CLIPPER: Compression enables long-context synthetic data generation
Chau Minh Pham, Yapei Chang, Mohit Iyyer

TL;DR
CLIPPER introduces a compression-based method for generating high-quality synthetic long-context data, significantly improving narrative claim verification accuracy and grounding in large language models.
Contribution
The paper presents a novel compression-based approach for synthetic data generation that enhances claim validity and reasoning complexity in long-context tasks.
Findings
Achieved 76% accuracy on narrative claim verification, a substantial improvement.
Constructed a 19K synthetic dataset with claims, source texts, and reasoning.
Set new state-of-the-art for sub-10B models on the NoCha leaderboard.
Abstract
LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw text of the book, which results in artifact-riddled claims, CLIPPER first compresses the book into chapter outlines and book summaries and then uses these intermediate representations to generate complex claims and corresponding chain-of-thoughts. Compared to naive approaches, CLIPPER produces claims that are more valid, grounded, and complex. Using CLIPPER, we construct a dataset of 19K synthetic book claims paired with their source texts and chain-of-thought reasoning, and use it to fine-tune…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems
