A Multilingual Information Extraction Pipeline for Investigative   Journalism

Gregor Wiedemann; Seid Muhie Yimam; Chris Biemann

arXiv:1809.00221·cs.CL·September 17, 2018

A Multilingual Information Extraction Pipeline for Investigative Journalism

Gregor Wiedemann, Seid Muhie Yimam, Chris Biemann

PDF

TL;DR

This paper presents a multilingual information extraction pipeline integrated into a journalism software, enabling efficient processing of large, diverse, and multilingual document collections for investigative reporting.

Contribution

It introduces a novel pipeline combining multiple NLP tools for multilingual entity extraction, tailored for investigative journalism workflows and large-scale data analysis.

Findings

01

Supports up to 40 languages for document processing

02

Enables quick extraction of entities, metadata, and full text

03

Facilitates visual exploration of large unstructured data collections

Abstract

We introduce an advanced information extraction pipeline to automatically process very large collections of unstructured textual data for the purpose of investigative journalism. The pipeline serves as a new input processor for the upcoming major release of our New/s/leak 2.0 software, which we develop in cooperation with a large German news organization. The use case is that journalists receive a large collection of files up to several Gigabytes containing unknown contents. Collections may originate either from official disclosures of documents, e.g. Freedom of Information Act requests, or unofficial data leaks. Our software prepares a visually-aided exploration of the collection to quickly learn about potential stories contained in the data. It is based on the automatic extraction of entities and their co-occurrence in documents. In contrast to comparable projects, we focus on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.