StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models

Yongrui Chen; Yangyang Ma; Xiaoying Huang; Shenyu Zhang; Huajun Chen; Haofen Wang; Guilin Qi

arXiv:2605.01939·cs.CL·May 5, 2026

StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models

Yongrui Chen, Yangyang Ma, Xiaoying Huang, Shenyu Zhang, Huajun Chen, Haofen Wang, Guilin Qi

PDF

TL;DR

StressEval is a framework that creates dynamic, challenging, and controllable test instances for large language models by transforming observed failures into new benchmark data.

Contribution

It introduces a failure-driven data synthesis method that enhances knowledge-intensive reasoning benchmarks with controllable difficulty and grounded instances.

Findings

01

Dynamic OneEval causes larger performance drops than static benchmarks.

02

StressEval retains explicit difficulty factors for actionable model improvement.

03

The framework effectively targets knowledge gaps and reasoning breakdowns.

Abstract

Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especially on knowledge intensive reasoning tasks While recent dynamic benchmarks can alleviate staleness they often increase difficulty at the expense of answerability and controllability In this paper we propose StressEval a failure driven data synthesis framework that turns observed model failures into dynamic challenging and controllable test instances StressEval consists of three stages first it constructs a semi structured difficulty card that identifies the failed reasoning step and its root cause second it applies a dual perspective instance synthesis method that targets both knowledge gaps and reasoning breakdowns while preserving the underlying difficulty factors and third it applies a gating mechanism to retain only grounded unambiguous instances Seeding from multiple knowledge intensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.