Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles
Ruggero Marino Lazzaroni, Jana Lasser, Kirill Solovev

TL;DR
Infini-News provides an efficient, comprehensive retrieval system for over 1.3 billion news articles from Common Crawl, enabling fast text pattern searches and enriching metadata for social science and NLP research.
Contribution
It introduces a scalable index and metadata enrichment pipeline for the entire CC-News archive, facilitating accessible large-scale news data analysis.
Findings
Enriched 1.35 billion articles with language and geographic metadata.
Constructed suffix-array indexes enabling sub-second text pattern searches.
Lowered barriers for longitudinal and cross-national media research.
Abstract
Large-scale news corpora support a wide range of research in Computational Social Science and NLP, yet access remains constrained: commercial archives impose prohibitive costs and licensing restrictions, while open alternatives like Common Crawl's CC-News require terabyte-scale storage and computationally intensive processing. We present Infini-News, a retrieval toolkit and index for the entire CC-News archive from August 2016 to the latest available snapshot. Our contributions are threefold. First, we extract, clean the text, and parse the structured metadata of over 1.35B articles. Second, we enrich the corpus with language detection using three frontier language classifiers (GlotLID, lingua, and CommonLingua), and with multi-source geographic attribution that resolves a country of origin for 83.4% of articles across 222 countries. Third, we construct Infini-gram indexes: suffix-array…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
