A Multilingual Information Extraction Pipeline for Investigative Journalism
Gregor Wiedemann, Seid Muhie Yimam, Chris Biemann

TL;DR
This paper presents a multilingual information extraction pipeline integrated into a journalism software, enabling efficient processing of large, diverse, and multilingual document collections for investigative reporting.
Contribution
It introduces a novel pipeline combining multiple NLP tools for multilingual entity extraction, tailored for investigative journalism workflows and large-scale data analysis.
Findings
Supports up to 40 languages for document processing
Enables quick extraction of entities, metadata, and full text
Facilitates visual exploration of large unstructured data collections
Abstract
We introduce an advanced information extraction pipeline to automatically process very large collections of unstructured textual data for the purpose of investigative journalism. The pipeline serves as a new input processor for the upcoming major release of our New/s/leak 2.0 software, which we develop in cooperation with a large German news organization. The use case is that journalists receive a large collection of files up to several Gigabytes containing unknown contents. Collections may originate either from official disclosures of documents, e.g. Freedom of Information Act requests, or unofficial data leaks. Our software prepares a visually-aided exploration of the collection to quickly learn about potential stories contained in the data. It is based on the automatic extraction of entities and their co-occurrence in documents. In contrast to comparable projects, we focus on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
