MegaWika 2: A More Comprehensive Multilingual Collection of Articles and their Sources

Samuel Barham; Chandler May; Benjamin Van Durme

arXiv:2508.03828·cs.DL·August 7, 2025

MegaWika 2: A More Comprehensive Multilingual Collection of Articles and their Sources

Samuel Barham, Chandler May, Benjamin Van Durme

PDF

1 Datasets

TL;DR

MegaWika 2 is an extensive multilingual dataset of Wikipedia articles with detailed citation and source information, designed to facilitate fact-checking and cross-lingual analysis.

Contribution

It significantly expands the original MegaWika dataset by increasing article count and citation coverage, supporting advanced fact-checking and temporal analysis tasks.

Findings

01

Six times more articles than MegaWika

02

Twice as many fully scraped citations

03

Supports fact checking and cross-language analysis

Abstract

We introduce MegaWika 2, a large, multilingual dataset of Wikipedia articles with their citations and scraped web sources; articles are represented in a rich data structure, and scraped source texts are stored inline with precise character offsets of their citations in the article text. MegaWika 2 is a major upgrade from the original MegaWika, spanning six times as many articles and twice as many fully scraped citations. Both MegaWika and MegaWika 2 support report generation research ; whereas MegaWika also focused on supporting question answering and retrieval applications, MegaWika 2 is designed to support fact checking and analyses across time and language.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

jhu-clsp/megawika-2
dataset· 87 dl
87 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.