Dataharvest Conference #EIJC18

From Thursday 24 to Sunday 27 May 2018, the EIJC 2018 conference (European Investigative Journalism Conference) will take place in Michelen (Belgium). We as newsleak project will participate and discuss requirements and needs of our targeted user group. All about the conference you can find out on this website:

Funding extension

We are happy to announce that the new/s/leak project receives some additional funding from the Volkswagen Stiftung. Until summer 2018, new/s/leak will be extended and refactored to achieve the following goals:

  • easy deployment for own usage
  • comprehensive and detailed documentation
  • improved user interface
  • improved information extraction (better keyterm extraction, named entity recognition, support of user dictionaries)
  • support for multiple languages (among others english, german, spanish, french, arabic, chinese)

Follow the updates on this blog to see how far we got ūüôā


new/s/leak demo @ SPIEGEL

Now that we’re in the middle of new/s/leak’s home stretch, we had a final demo at SPIEGEL in Hamburg. After some exciting and productive development sprints, we¬†proudly introduced the software to journalists, documentarists and software developers, who gave us the best feedback by playing around with the tool and becoming absorbed in using it. Some evidence:

We also collected some more systematic feedback, which helped us prioritizing the remaining tasks. Thanks to everyone who came along, played and gave feedback – we had a blast at the meeting, and we learned a lot!

If you also want to see what changed  in new/s/leak since we have shown it to an academic audience at ACL: here is the link to the demo (please use the Chrome Browser!)

For a quick introduction, you can also watch a video (from our academic publication @ VIP):

During the upcoming weeks until christmas, we’ll add some more requested features, fix some bugs, and create an easy-to-deliver software package. Stay tuned for a deployable version!

new/s/leak @ VIP

Last week, new/s/leak had its academic debut in the visualization science community at the Visualization in Practice Workshop, co-located with the IEEE VIS 2016 conference.

Here is the paper¬†documenting the software with a focus on visualization. Needless to say that it’s always fun to present new/s/leak and get more feedback:

Kathrin presenting new/s/leak

Kathrin presenting new/s/leak

Thanks to everyone who came and visited us!


Paper accepted @ VIS 2016

Our Paper “new\s\leak — A Tool for Visual Exploration of Large Text Document Collections in the Journalistic Domain” has been accepted for presentation at the poster session of the Visualization in Practice Workshop, which is part of the IEEE VIS 2016 conference. The workshop will take place in¬†Baltimore Maryland, USA on October 24-25.

VIS is one of the most important conferences in visualization science. new/s/leak fits perfectly in this year’s VIP workshop, the focus of which is design, development, distribution, and application of open source¬†visualization and visual analytics software.

Meet us at the demo session in Baltimore!

new/s/leak @ ACL 2016

Last week, we presented new/s/leak for the first time in public: we had our demo session at the annual meeting of the Association for Computational Linguistics, which was held in Berlin this year.
If you haven’t had the chance to attend (or you had and want to have some references now):

  • Here’s the paper documenting our software (you’ll find a large part of the information from the paper in this blog, too)
  • Here’s our poster (in PDF format)

And, of course, we took some pictures, testifying how much fun we had (click for larger versions):

Alex walking through the new/s/leak poster (with Seid listening)

Alex walking through the new/s/leak poster (with Seid listening)

Seid explaining new/s/leak

Seid explaining new/s/leak

Chris, Heiner and Seid busy discussing new/s/leak

Chris, Heine, Seid and Alex discussing new/s/leak with different people

Seid and many curious new/s/leak fans

Seid and many curious new/s/leak fans

The crew busy explaining, demonstrating and (secretly) playing with new/s/leak

The crew busy explaining, demonstrating and (secretly) playing with new/s/leak

Thanks to everyone who stopped by, especially for the great suggestions for improvements! Be sure that we’re working on that while you’re reading this post. If you have any more ideas, application visions or simply want to debate information extraction and / or journalism – please get in touch!

The Science behind new/s/leak II: Interactive Visualization

We already explained the language-related data wrangling happening under new/s/leak’s¬†hood. For the success of new/s/leak, our second scientific field is the game changer: interactive¬†visualization. No matter how much accurate information we can produce – if we cannot present them to the user in an appealing way, the tool will fail its goals. So how exactly is visualization science influencing new/s/leak?

Your daily dose of visualization science

It might seem easy to create some kind of visualization (with Excel or even pen and paper) – however, there are lots of pitfalls that you need to avoid to create good visualizations. You might take some of them for granted because you encounter them everyday when browsing the web – but they’d be painfully missed otherwise. Two¬†examples which you can find in many applications and websites (and which we of course also consider for new/s/leak):

  • Animation speed: there is a certain animation speed that is pleasant to look at, and it cannot be much faster or slower if we want to convey information. While it’s intuitively clear that reaaallyyyy slooooowwww animations can be annoying (think of those endless progress bars…),¬†going to fast can overload you, even if there is no critical information involved. For evidence, take a minute look at those two guys working out (they are doing exactly the same movement).

    Now answer the question: would you pick the guy on the right as your office mate? Might become exhausting really soon (even if he’s really smart).
  • Colors: If you us colors for information visualization, you cannot just use any color set you find appealing. While there is scientific work on which color sets are good for which¬†information types, we also need to think about color blind people. On many web pages, the color sets are already designed to be suitable for different types of color perception. See how the new/s/leak logo would look for color blind people (created with Coblis):

Those are just two examples for a whole lot of guidelines that visualization scientists have developed and that we encounter in each good visualization. Of course new/s/leak has to follow all of those guidelines Рwhich becomes harder with more complex data, and on a scale. Which brings us to the next important point:


Accurate views on loads of complex data

This happens with too much information displayed at onceThe largest challenge (and thus the largest need) for visualization science and language technology alike comes with the huge amounts of data we have to handle. A leak can be anything from 100 Documents to 1TB (or more). This is not only a lot for research-based software, but also enough to break many commercial applications. So, this is where the action is.
Visualizing data for investigative purposes means that the software may not show anything untrue (not even shady, or too ambiguous). However, there is so much data to display (all the documents, their metadata, and all the entities we extracted) – we simply cannot display the whole truth in one screen, that would be a) impossible and b) completely unusable. Imagine a network that shows all the information it has – that quickly becomes a “hairball” like in the picture on the left.

Because new/s/leak should be intuitively accessible for users with different backgrounds and without much training, Usabilitywe need to find easy interfaces for giant piles of complex data.  This excludes e.g. some very powerful but rather complicated interfaces used for search in scientific environments.

One extreme way to tackle this is the way Google presents the internet to us, or rather the 60 trillion pages it has indexed: Initially, Google doesn’t display anything, but rather lets the user explore (see comic on the right). While we allow the user to explore the data on their own, new/s/leak’s main purpose is¬†to guide users through the data jungle and to provide a concise graphical summary of the core plots. In consequence, we display an initial graph as entry point that contains the most interesting entities and relations. It’s a scientific question of its own to find out what most interesting means, but according to our users, frequency is a good indicator here. We thus show initially the most frequent people, companies and places, and let the user explore from there. (If they wish, users can also pick the least frequent entities first – this might¬†foster serendipity.)


Knowing our users

Coming back to the usability comic – of course we have to design new/s/leak for its users. That’s what Apple and Google do, too: providing a simple one-fits-all screen as an entry point. While we have a smaller user group, we also have an application with more interaction possibilities. So the key to everything lies in knowing our users, in order to find the right mixture of the google window (with too little functionality) and an overloaded brick puzzle (which is ugly and unusable).

User studies are, in fact, an important pillar of visualization science. We already told you about early requirements management. It might sound trivial that we ask people what they like and dislike, and then we change the system accordingly – but this is no matter of course at all¬†(see e.g. some excuses why companies don’t do user research). One of the reasons for this is that it takes actually lots of experience to design a good user study, which assess all the information needed without influencing the user, and which allows to generate¬†meaningful hypotheses for interface changes. Further, it takes simply quite some¬†time to undertake such studies repeatedly. Fortunately, our visualization scientists¬†are experts for user studies, and they are dedicated to testing our system frequently and in a way that allows for objective, comprehensive evaluation.

Paper accepted @ SKILL 2016

Franziska, who designed and documented our requirements analysis for her research seminar paper (Studienarbeit), turned this into an article accepted for SKILL 2016, titled “new/s/leak – Anforderungsanalyse einer interaktiven Visualisierung f√ľr Data-Driven Journalism” (in German – English translation: “new/s/leak – Requirements analysis of an interactive visualization for data driven journalism”).

Congrats Franziska, and of course thanks for the excellent analysis!

Paper accepted @ ACL 2016

Our paper “new/s/leak ‚Äď Information Extraction and¬†Visualization for an Investigative Data Journalists” was accepted for the Systems Demonstration Track at ACL 2016!

ACL is the most important conference for computational linguistics, and we will present new/s/leak there. Stay tuned for the final paper version – and our publicly available prototype.

Looking forward to meeting you in Berlin!

new/s/leak’s impact on science

There are many reasons why data journalism needs new scientific approaches, and we have discussed some of them at length. So far we haven’t talked much about¬†the reverse claim, which is, however, equally true: this journalistic project also advances science.
So why are our scientists so passionate about the new/s/leak project? And what kind of scientific challenges do we face?


/S/cience on a Mission

All our scientists agree that it’s an invaluable experience to work on real use cases, solving¬†real-world problems and collecting extensive feedback from real users.

Science on a mission to help journalistsLearning how journalists work and researching new ways to help them would be already exciting enough on¬†its own –¬†but investigative journalism is way more than just an intriguing¬†application scenario: journalistic work ¬†has essential social impact, and our software will help to increase transparency not only for journalists, but also everyone out there who reads, watches or listens to their stories.


/S/caling up

Scaling up to big data language processing and visualization

Genuine use cases come with genuine research challenges: for both the visual and the backend part of new/s/leak, we need to turn scientific prototypes into a scalable, user-friendly, big-data-proof application.

On the backend side, we thus need techniques to speed up data processing without sacrificing quality, for which we also need lots of engineering with new frameworks and tools.

Of course we also need a way to keep the actual user interface clean and responsive, regardless of the (possibly huge) amount of data behind it. This is a core challenge tackled from the visualization side.


/S/ubstantial Interactivity

We need to integrate all kinds of user interactionInteraction design is a challenge for new/s/leak in many ways: First of all, we have our visualization scientists devoted to the challenge of finding a smart interface that allows for intuitive user interaction. On the backend side, we need to integrate user interactions into the language processing pipeline (see our Requirements Analysis), because we want to enable users to define entities.

We also need to create possibilities for the users to interact collaboratively within the newsroom.

And, of course, we need to design our own interaction process for the interdisciplinary development of frontend and backend, and we need to translate between journalists and scientists. As with almost all projects, the things that make new/s/leak more exciting also do bring more challenges.


/S/uccess Indicators

Scientists like to measure success in reproducible numbers. For example, we could rate an algorithm for Entity Recognition by counting how many entities it recognizes correctly, using a text in which all entities were marked by human experts. This is great because you can compare different systems, and you can track the progress of your own approach. From a scientist’s point of view, we’d strive for¬†such an evaluation strategy for new/s/leak, too – but it’s not all that easy.

We need to develop user-centric evaluation methods for scientific methodsWe cannot just count how often new/s/leak shows something which is relevant, because one single fact (or even sentence) of a leak is hardly ever relevant on its own. We cannot reverse-engineer this problem either: an article based on a leak has no particular list of text snippets that constitutes all the information contained in the story. Rather than counting and comparing, we will we take an approach which our experts for graphical interactive systems also use regularly: we will (and have) conduct(ed) user studies, and then ask questions that allow us to quantify success without using exact measures. Like the whole project, the definition of success needs to be scientifically grounded, but entirely user-centric.

For new/s/leak’s scientists (and also the journalists, of course), this project will be successful if the software will help many users to discover information that matters. And all of us hope that the work on new/s/leak will be sustainably continued in follow-up projects.