LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability

Zikai Xiao; Fei Huang; Jianhong Tu; Jianhui Wei; Wen Ma; Yuxuan Zhou; Jian Wu; Bowen Yu; Zuozhu Liu; Junyang Lin

arXiv:2510.24345·cs.CL·October 29, 2025

LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability

Zikai Xiao, Fei Huang, Jianhong Tu, Jianhui Wei, Wen Ma, Yuxuan Zhou, Jian Wu, Bowen Yu, Zuozhu Liu, Junyang Lin

PDF

1 Datasets

TL;DR

LongWeave introduces a benchmark for long-form generation that balances real-world relevance with verifiability, enabling rigorous assessment of LLMs' capabilities in complex, realistic scenarios with customizable lengths.

Contribution

The paper presents LongWeave and CoV-Eval, a novel benchmark and evaluation method that systematically assesses LLMs on realistic, verifiable long-form generation tasks.

Findings

01

State-of-the-art models struggle with complex, long outputs.

02

Evaluation reveals significant challenges in meeting real-world constraints.

03

LongWeave supports up to 64K/8K token tasks across diverse scenarios.

Abstract

Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce \textbf{LongWeave}, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

zikaixiao1/LongWeave
dataset· 6 dl
6 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.