DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic   Long-Context Reasoning Capabilities

Tianyi Zhuang; Chuqiao Kuang; Xiaoguang Li; Yihua Teng; Jihao Wu,; Yasheng Wang; Lifeng Shang

arXiv:2502.17807·cs.AI·February 26, 2025

DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities

Tianyi Zhuang, Chuqiao Kuang, Xiaoguang Li, Yihua Teng, Jihao Wu,, Yasheng Wang, Lifeng Shang

PDF

Open Access

TL;DR

DocPuzzle is a new benchmark designed to evaluate long-context reasoning in large language models, using expert-level questions and a human-AI validation process to ensure quality and challenge models' multi-step reasoning skills.

Contribution

It introduces a rigorous, process-aware benchmark with a novel evaluation framework that reduces guessing bias and assesses reasoning capacities in LLMs with real-world documents.

Findings

01

Advanced reasoning models outperform general instruct models.

02

Distilled models lag behind teacher models in reasoning ability.

03

The benchmark sets new standards for evaluating long-context reasoning in LLMs.

Abstract

We present DocPuzzle, a rigorously constructed benchmark for evaluating long-context reasoning capabilities in large language models (LLMs). This benchmark comprises 100 expert-level QA problems requiring multi-step reasoning over long real-world documents. To ensure the task quality and complexity, we implement a human-AI collaborative annotation-validation pipeline. DocPuzzle introduces an innovative evaluation framework that mitigates guessing bias through checklist-guided process analysis, establishing new standards for assessing reasoning capacities in LLMs. Our evaluation results show that: 1)Advanced slow-thinking reasoning models like o1-preview(69.7%) and DeepSeek-R1(66.3%) significantly outperform best general instruct models like Claude 3.5 Sonnet(57.7%); 2)Distilled reasoning models like DeepSeek-R1-Distill-Qwen-32B(41.3%) falls far behind the teacher model, suggesting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis · AI-based Problem Solving and Planning