The Science behind new/s/leak II: Interactive Visualization

We already explained the language-related data wrangling happening under new/s/leak’s hood. For the success of new/s/leak, our second scientific field is the game changer: interactive visualization. No matter how much accurate information we can produce – if we cannot present them to the user in an appealing way, the tool will fail its goals. So how exactly is visualization science influencing new/s/leak?

Your daily dose of visualization science

It might seem easy to create some kind of visualization (with Excel or even pen and paper) – however, there are lots of pitfalls that you need to avoid to create good visualizations. You might take some of them for granted because you encounter them everyday when browsing the web – but they’d be painfully missed otherwise. Two examples which you can find in many applications and websites (and which we of course also consider for new/s/leak):

  • Animation speed: there is a certain animation speed that is pleasant to look at, and it cannot be much faster or slower if we want to convey information. While it’s intuitively clear that reaaallyyyy slooooowwww animations can be annoying (think of those endless progress bars…), going to fast can overload you, even if there is no critical information involved. For evidence, take a minute look at those two guys working out (they are doing exactly the same movement).

    Now answer the question: would you pick the guy on the right as your office mate? Might become exhausting really soon (even if he’s really smart).
  • Colors: If you us colors for information visualization, you cannot just use any color set you find appealing. While there is scientific work on which color sets are good for which information types, we also need to think about color blind people. On many web pages, the color sets are already designed to be suitable for different types of color perception. See how the new/s/leak logo would look for color blind people (created with Coblis):

Those are just two examples for a whole lot of guidelines that visualization scientists have developed and that we encounter in each good visualization. Of course new/s/leak has to follow all of those guidelines – which becomes harder with more complex data, and on a scale. Which brings us to the next important point:


Accurate views on loads of complex data

This happens with too much information displayed at onceThe largest challenge (and thus the largest need) for visualization science and language technology alike comes with the huge amounts of data we have to handle. A leak can be anything from 100 Documents to 1TB (or more). This is not only a lot for research-based software, but also enough to break many commercial applications. So, this is where the action is.
Visualizing data for investigative purposes means that the software may not show anything untrue (not even shady, or too ambiguous). However, there is so much data to display (all the documents, their metadata, and all the entities we extracted) – we simply cannot display the whole truth in one screen, that would be a) impossible and b) completely unusable. Imagine a network that shows all the information it has – that quickly becomes a “hairball” like in the picture on the left.

Because new/s/leak should be intuitively accessible for users with different backgrounds and without much training, Usabilitywe need to find easy interfaces for giant piles of complex data.  This excludes e.g. some very powerful but rather complicated interfaces used for search in scientific environments.

One extreme way to tackle this is the way Google presents the internet to us, or rather the 60 trillion pages it has indexed: Initially, Google doesn’t display anything, but rather lets the user explore (see comic on the right). While we allow the user to explore the data on their own, new/s/leak’s main purpose is to guide users through the data jungle and to provide a concise graphical summary of the core plots. In consequence, we display an initial graph as entry point that contains the most interesting entities and relations. It’s a scientific question of its own to find out what most interesting means, but according to our users, frequency is a good indicator here. We thus show initially the most frequent people, companies and places, and let the user explore from there. (If they wish, users can also pick the least frequent entities first – this might foster serendipity.)


Knowing our users

Coming back to the usability comic – of course we have to design new/s/leak for its users. That’s what Apple and Google do, too: providing a simple one-fits-all screen as an entry point. While we have a smaller user group, we also have an application with more interaction possibilities. So the key to everything lies in knowing our users, in order to find the right mixture of the google window (with too little functionality) and an overloaded brick puzzle (which is ugly and unusable).

User studies are, in fact, an important pillar of visualization science. We already told you about early requirements management. It might sound trivial that we ask people what they like and dislike, and then we change the system accordingly – but this is no matter of course at all (see e.g. some excuses why companies don’t do user research). One of the reasons for this is that it takes actually lots of experience to design a good user study, which assess all the information needed without influencing the user, and which allows to generate meaningful hypotheses for interface changes. Further, it takes simply quite some time to undertake such studies repeatedly. Fortunately, our visualization scientists are experts for user studies, and they are dedicated to testing our system frequently and in a way that allows for objective, comprehensive evaluation.

Paper accepted @ SKILL 2016

Franziska, who designed and documented our requirements analysis for her research seminar paper (Studienarbeit), turned this into an article accepted for SKILL 2016, titled “new/s/leak – Anforderungsanalyse einer interaktiven Visualisierung für Data-Driven Journalism” (in German – English translation: “new/s/leak – Requirements analysis of an interactive visualization for data driven journalism”).

Congrats Franziska, and of course thanks for the excellent analysis!

Paper accepted @ ACL 2016

Our paper “new/s/leak – Information Extraction and Visualization for an Investigative Data Journalists” was accepted for the Systems Demonstration Track at ACL 2016!

ACL is the most important conference for computational linguistics, and we will present new/s/leak there. Stay tuned for the final paper version – and our publicly available prototype.

Looking forward to meeting you in Berlin!

new/s/leak’s impact on science

There are many reasons why data journalism needs new scientific approaches, and we have discussed some of them at length. So far we haven’t talked much about the reverse claim, which is, however, equally true: this journalistic project also advances science.
So why are our scientists so passionate about the new/s/leak project? And what kind of scientific challenges do we face?


/S/cience on a Mission

All our scientists agree that it’s an invaluable experience to work on real use cases, solving real-world problems and collecting extensive feedback from real users.

Science on a mission to help journalistsLearning how journalists work and researching new ways to help them would be already exciting enough on its own – but investigative journalism is way more than just an intriguing application scenario: journalistic work  has essential social impact, and our software will help to increase transparency not only for journalists, but also everyone out there who reads, watches or listens to their stories.


/S/caling up

Scaling up to big data language processing and visualization

Genuine use cases come with genuine research challenges: for both the visual and the backend part of new/s/leak, we need to turn scientific prototypes into a scalable, user-friendly, big-data-proof application.

On the backend side, we thus need techniques to speed up data processing without sacrificing quality, for which we also need lots of engineering with new frameworks and tools.

Of course we also need a way to keep the actual user interface clean and responsive, regardless of the (possibly huge) amount of data behind it. This is a core challenge tackled from the visualization side.


/S/ubstantial Interactivity

We need to integrate all kinds of user interactionInteraction design is a challenge for new/s/leak in many ways: First of all, we have our visualization scientists devoted to the challenge of finding a smart interface that allows for intuitive user interaction. On the backend side, we need to integrate user interactions into the language processing pipeline (see our Requirements Analysis), because we want to enable users to define entities.

We also need to create possibilities for the users to interact collaboratively within the newsroom.

And, of course, we need to design our own interaction process for the interdisciplinary development of frontend and backend, and we need to translate between journalists and scientists. As with almost all projects, the things that make new/s/leak more exciting also do bring more challenges.


/S/uccess Indicators

Scientists like to measure success in reproducible numbers. For example, we could rate an algorithm for Entity Recognition by counting how many entities it recognizes correctly, using a text in which all entities were marked by human experts. This is great because you can compare different systems, and you can track the progress of your own approach. From a scientist’s point of view, we’d strive for such an evaluation strategy for new/s/leak, too – but it’s not all that easy.

We need to develop user-centric evaluation methods for scientific methodsWe cannot just count how often new/s/leak shows something which is relevant, because one single fact (or even sentence) of a leak is hardly ever relevant on its own. We cannot reverse-engineer this problem either: an article based on a leak has no particular list of text snippets that constitutes all the information contained in the story. Rather than counting and comparing, we will we take an approach which our experts for graphical interactive systems also use regularly: we will (and have) conduct(ed) user studies, and then ask questions that allow us to quantify success without using exact measures. Like the whole project, the definition of success needs to be scientifically grounded, but entirely user-centric.

For new/s/leak’s scientists (and also the journalists, of course), this project will be successful if the software will help many users to discover information that matters. And all of us hope that the work on new/s/leak will be sustainably continued in follow-up projects.

The Science behind new/s/leak I: Language Technology

Because of the Easter holiday season and several conference deadlines, this blog had to take a little break. Being back, we want to give a glimpse on the science behind of new/s/leak.

We have two camps of scientists working together: computational linguists contribute software that extract semantic knowledge from texts, and visualization experts who  bring the results to smart interactive interfaces that are easy to use for journalists (after the computational linguists made the dataset even more complicated than before).

In this post, we will explain some of the semantic technology that helps to answer the questions “Who does what to whom – and when and where?”. The visualization science will be covered in a later feature.

Enriching texts with names and events

The results of the language technology software are easy to explain: we feed all the texts we want to analyze into several smart algorithms, and those algorithms find names of people, companies and places, and they spot dates. On top of those key elements (or “entities”), we finally extract the relationships between them, e.g. some person talks about a place, leaves a company, or pays another person. Finally, we are ready to put all of this into a nice network visualization.


Entity and Relation Extraction for new/s/leak

We hope that you’re not ready to accept that all of this simply happens by computational magic, so let’s dig a bit deeper:

(Disclaimer: This is not a scientifically accurate explanation, but rather a vey brief high-level illustration of some science-based concepts.)

Identifying names – 🍎 vs. 

Identifying words that name people, organizations and so on is not as easy as it might sound. (In Computational Linguistics, this tasks is called Named Entity Recognition, in short: NER).

Just looking through a big dictionary containing names works sometimes, but many names can also be things, like Stone (that can be Emma Stone or a rock) or Apple (which can be food or those people who are selling the smartphones).  Within a sentence however, it’s almost always clear which one is meant (at least to humans):

“Apple sues Samsung.” clearly the company, whereas

“Apple pie is really delicious.”

probably means the fruit. The examples also show that just checking for upper or lower case is not sufficient, either.

What the algorithms do instead is first deciding whether a word is a name at all (as in the  case), or rather some common noun (that’s the 🍎 case). There are two factors that decide that: first, how likely the string “apple” is to be a name, no matter in which context. (Just to put some numbers in, say the word apple has a 60% likelihood of being a company, and 40% to be a noun.) Additionally, the algorithms checks the likelihood to have a name in the given context. (Again, with exemplary numbers:  any word, no matter which one it is, in the beginning of a sentence followed by a verb, has a likelihood of 12% to be a name; followed by a noun, the likelihood is 8%, and so on).

With this kind of information, the NER algorithm decides whether, in the given sentence, Apple is most likely to be a name (or something else).

In the final step, the algorithm uses similar methods to decide whether the name is more likely to belong to a person, a company or a place.

There are many different tools for named entity recognition; new/s/leak uses the Epic system.


In principle, extracting dates (like “April 1st” or “2015-04-01”) works very similar to extracting names. But often dates are incomplete – then we need more information: If we only find “April 1st” with no year given, we need some indicator which year could be meant. In our case, the algorithm checks the publishing date of the document (which we almost always have for journalistic leaks) and defaults all missing years with the publishing year.

The extraction of time expressions in new/s/leak is done with the Heideltime tool.

Finding relations (or events)

Now that we found that somewhere in our text collection are  Apple and Samsung, and both are companies, we want to know whether or not they actually have some business together, and if so, how they are connected. The algorithms behind this do a very human-like thing: they read  all the texts and check whether or not they find Apple and Samsung (as companies) in the same document, and if so, they try to find out whether there is some event (like “suing” in the sentence above) that connects the two directly. There might also be multiple such relations, or they might change over time – then we try to find the most relevant ones. Relevant events in our example are things mentioned frequently for Apple and Samsung, but rarely in other contexts. E.g. if we find additionally the sentence “Apple talks about Samsung” somewhere, talking would probably be less relevant than suing (from  “Apple sues Samsung”), because talking shows up more often than suing and is not very specific for the Apple / Samsung story.

To find relations between entities, we use the same system employed in the Network of Days, together with relevance information computed by JTopia.

Now that we have all this information about people, organizations, times and places, the software of our visualization scientists can display them together into one interactive graph. This visualization science part will be covered in one of the next entries.

Requirements Management

User requirements management is something that happens far too rarely, especially in scientific software. (And it can definitely be challenging.)

For our project that brings together so different worlds of science and journalism, and also different academic disciplines, it’s even more important. We dedicated this a whole day on which we had Franziska and Kathrin over at SPIEGEL in Hamburg – and we proved that requirements analysis can be both, challenging and fun at the same time.

Overall,  Kathrin and Franziska interviewed four journalists from different newsrooms that showed the whole diversity of potential new/s/leak user groups .

New Priorities

Some of the journalists’ answers were interesting just because they prioritize things we thought were maybe nice from our point of view, but maybe not so important to the end user. So here is the top 3 of surprising lessons learnt:

  1. Metadata that comes with the documents is even more important than we thought. Our software thus should not just display some selected metadata features (like time and geolocations), but rather show everything we can extract from the data, including e.g. also data types and file sizes. (One showcase for the journalistic value of metadata is this feature about the Hacking Team Leak.)
  2. Source Documents have to be always accessible. Our initial idea was to focus on the network of entities and to show the documents just on demand – but the journalists need a direct way to the original documents in each view, and then filter the documents by selecting certain entities, entity relations, time spans or other metadata.
  3. Networks are an utterly intuitive concept. Many concepts and figures from network theory (like centrality, connectedness, outliers…) have intuitive counterparts (“Who is in the center of all of this?”,”Who is best connected to whom?”, “Can I see who’s at the top of the communication hierarchy?”), and can provide crucial information. That’s good news, and that also means that we have to be even more flexible when computing the connections in the network.
Scribbling User Requirements

Drafting the next new/s/leak version after the interviews

User-Specific needs

Some functionality needs to be highly adaptable to meet the needs of different user groups and different working styles. The focus here is on two things:

  1. Powerful tagging functionality. We need to support free-text tagging, bookmarking and simple markers like “important” vs. “unimportant”. This allows users to create their own metadata.
  2. Transparency. Some users prefer precise results over extended functionality, other users (especially people working under time pressure) would sacrifice a bit of accuracy for more automated support to filter the data. To meet both needs, we will provide as much automated support as possible, but at the same time, we will clearly indicate what the machine generated, how confident we are about the machine’s result, and which part of the information is genuine (as in: was part of the source documents).
new/s/leak sketch

The scribbled wireframe (with some annotation)

The productive day at SPIEGEL was concluded with some final discussions, first drafts of wireframes, and coffee (see pictures).

Our next goal is to finish a first stand-alone prototype, with a special focus on relation extraction for the network.

Science + Data Journalsim = new/s/leak

On January 1st, we officially started to build our “Network of Searchable Leaks” or, in short: new/s/leak. Our goal is to put the lastest reseearch of language technolgy and data visualization together to help journalists keeping their heads over water when meeting a dataset like the famous Cablegate. The idea is to have a network of all actors (people, organizations, places) and show who will do what, with whom, where, and when.

What sounds like magic is actually feasible using current research results: sceptics might want to look at the Network of the Day (in German), which will be the starting point for our new tool.

At some point, we want to arrive at something ressembling this sketch from our project proposal:

Wireframe from proposal

An early wireframe for our software


The first kickoff with all project players in one room happend on January 18 (after several internal kickoffs and the meeting at Datenlabor): we were all warmly welcomed and well-caffeinated guests of our Visualization Colleagues from Interactive Graphics Systems Group at TU Darmstadt. We had lots of constructive discussions about journalists’ needs, search, visual data representations, and our project name (which was the only question we had to postpone).
The most important outcome is that we are on a good way:

Four TU Darmstadt computer science students (Lukas Raymann, Patrick Mell, Bettina Johanna Ballin and Nils Christopher Böschen)  already built a prototype as their software project. It shows a network of entities from the underlying documents, together with a timeline:


The first new/s/leak prototype

The screenshot offers a glimpse on something which which could have helped the people that had to work double shifts to browse the 2 million records of the Cablegate Leaks – if new/s/leak had been around at that time already.

The next steps will bring more search functionality, dynamic changes in the network, and more data.


We made it!

Happy news: VW foundation officially decided to fund our project with the working title DIVID-DJ: Data Extraction and Interactive Visualization of Unexplored Textual Datasets for Investigative Data-Driven Journalism.
We are one out of eight projects funded as a part of the initiative “Science and Data Journalism”. Our goal is to create a piece of software that visualizes the content of large text data collections, to help journalists working with data leaks.

VW foundation invited all project partners to a kickoff meeting at TU Dortmund, where all projects were introduced prior to the “Daten-Labor” conference of Netzwerk Recherche. The project funding will officially start in January 2016.

More details come!