News Harvesting from Google News combining Web Scraping, LLM Metadata Extraction and SCImago Media Rankings enrichment: a case study of IFMIF-DONES
Victor Herrero-Solana

TL;DR
This paper presents a comprehensive methodology for constructing news datasets from Google News using web scraping, LLM metadata extraction, and media rankings, demonstrated through a case study on the IFMIF-DONES project.
Contribution
It introduces a systematic five-stage pipeline combining multiple techniques for news data collection and analysis, highlighting challenges and solutions for dataset quality.
Findings
High overlap with proprietary news databases, with 76% of Google News records being exclusive.
Captured diverse content types, including social media and institutional communications.
Identified significant challenges like noise, temporal instability, and hallucinations in LLM extraction.
Abstract
This study develops and evaluates a systematic methodology for constructing news datasets from Google News, combining automated web scraping, large language model (LLM)-based metadata extraction, and SCImago Media Rankings enrichment. Using the IFMIF-DONES fusion energy project as a case study, we implemented a five-stage data collection pipeline across 81 region-language combinations, yielding 1,482 validated records after a 56% noise reduction. Results are compared against two licensed press databases: MyNews (2,280 records) and ProQuest Newsstream Collection (148 records). Overlap analysis reveals high complementarity, with 76% of Google News records exclusive to this platform. The dataset captures content types absent from proprietary databases, including specialized outlets, institutional communications, and social media posts. However, significant methodological challenges emerge:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
