The paper “A Multilingual Information Extraction Pipeline for Investigative Journalism” focusing on the information extraction component of new/s/leak 2.0 has been accepted at the software demonstrations track of the 2018 Conference on Empirical Methods in Natural Language Processing
The conference paper is available in the ACL anthology (here).
Abstract: We introduce an advanced information extraction pipeline to automatically process very large collections of unstructured textual data for the purpose of investigative journalism. The pipeline serves as a new input processor for the upcoming major release of our New/s/leak 2.0 software, which we develop in cooperation with a large German news organization. The use case is that journalists receive a large collection of files up to several Gigabytes containing unknown contents. Collections may originate either from official disclosures of documents, e.g. Freedom of Information Act requests, or unofficial data leaks.