Author Unknown: Evaluating Performance of Author Extraction Libraries on Global Online News Articles
Sriharsha Hatwar, Virginia Partridge, Rahul Bhargava, Fernando Bermejo

TL;DR
This paper evaluates the performance of five author extraction tools on multilingual online news articles, revealing variability across languages and highlighting the need for further validation for reliable author identification.
Contribution
It introduces a cross-lingual dataset for author extraction and compares existing tools, providing insights into their effectiveness across different languages.
Findings
Go-readability and Trafilatura are most consistent for author extraction
All tools show high variability across languages
Further validation needed for specific languages and regions
Abstract
Analysis of large corpora of online news content requires robust validation of underlying metadata extraction methodologies. Identifying the author of a given web-based news article is one example that enables various types of research questions. While numerous solutions for off-the-shelf author extraction exist, there is little work comparing performance (especially in multilingual settings). In this paper we present a manually coded cross-lingual dataset of authors of online news articles and use it to evaluate the performance of five existing software packages and one customized model. Our evaluation shows evidence for Go-readability and Trafilatura as the most consistent solutions for author extraction, but we find all packages produce highly variable results across languages. These findings are relevant for researchers wishing to utilize author data in their analysis pipelines,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Advanced Text Analysis Techniques · Data Quality and Management
