RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis

Andrew Zhuoer Feng; Cunxiang Wang; Yu Luo; Bosi Wen; Yidong Wang; Lin Fan; Yilin Zhou; Zikang Wang; Wenbo Yu; Lindong Wu; Hongning Wang; Minlie Huang

arXiv:2603.00686·cs.CL·March 3, 2026

RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis

Andrew Zhuoer Feng, Cunxiang Wang, Yu Luo, Bosi Wen, Yidong Wang, Lin Fan, Yilin Zhou, Zikang Wang, Wenbo Yu, Lindong Wu, Hongning Wang, Minlie Huang

PDF

Open Access

TL;DR

RAVEL introduces an agentic framework for evaluating and validating complex LLM text synthesis operations, revealing that reasoning ability significantly impacts synthesis quality more than raw generative power.

Contribution

The paper presents RAVEL, a novel framework enabling autonomous planning and execution of synthesis tasks, and introduces C3EBench, a comprehensive benchmark for detailed LLM evaluation.

Findings

01

Most LLMs struggle with context-understanding tasks under limited instructions.

02

Agentic synthesis performance is driven more by reasoning than generative capacity.

03

A strong reasoner can improve weaker generators' output quality.

Abstract

Large Language Models have evolved from single-round generators into long-horizon agents, capable of complex text synthesis scenarios. However, current evaluation frameworks lack the ability to assess the actual synthesis operations, such as outlining, drafting, and editing. Consequently, they fail to evaluate the actual and detailed capabilities of LLMs. To bridge this gap, we introduce RAVEL, an agentic framework that enables the LLM testers to autonomously plan and execute typical synthesis operations, including outlining, drafting, reviewing, and refining. Complementing this framework, we present C3EBench, a comprehensive benchmark comprising 1,258 samples derived from professional human writings. We utilize a "reverse-engineering" pipeline to isolate specific capabilities across four tasks: Cloze, Edit, Expand, and End-to-End. Through our analysis of 14 LLMs, we uncover that most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques