WRAP++: Web discoveRy Amplified Pretraining

Jiang Zhou; Yunhao Wang; Xing Wu; Tinghao Yu; Feng Zhang

arXiv:2604.06829·cs.CL·April 10, 2026

WRAP++: Web discoveRy Amplified Pretraining

Jiang Zhou, Yunhao Wang, Xing Wu, Tinghao Yu, Feng Zhang

PDF

TL;DR

WRAP++ enhances large language model pretraining by discovering and synthesizing cross-document relationships from web hyperlinks, significantly amplifying training data with relational knowledge beyond single documents.

Contribution

It introduces a novel method for discovering cross-document relationships and synthesizing joint QA, greatly expanding training data for improved LLM knowledge acquisition.

Findings

01

Amplifies ~8.4B tokens to 80B tokens of cross-document QA data.

02

Models trained with WRAP++ outperform single-document approaches.

03

Cross-document synthesis yields better performance and scaling gains.

Abstract

Synthetic data rephrasing has emerged as a powerful technique for enhancing knowledge acquisition during large language model (LLM) pretraining. However, existing approaches operate at the single-document level, rewriting individual web pages in isolation. This confines synthesized examples to intra-document knowledge, missing cross-document relationships and leaving facts with limited associative context. We propose WRAP++ (Web discoveRy Amplified Pretraining), which amplifies the associative context of factual knowledge by discovering cross-document relationships from web hyperlinks and synthesizing joint QA over each discovered document pair. Concretely, WRAP++ discovers high-confidence relational motifs including dual-links and co-mentions, and synthesizes QA that requires reasoning across both documents. This produces relational knowledge absent from either source document alone,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.