TL;DR
REST introduces a stress-testing framework for large reasoning models by evaluating their performance on multiple problems simultaneously, revealing weaknesses not apparent in traditional single-question benchmarks.
Contribution
This work presents REST, a novel multi-problem stress-testing framework that better assesses reasoning models' robustness and capabilities under realistic, multi-context conditions.
Findings
State-of-the-art models degrade significantly under REST stress tests.
REST outperforms existing benchmarks in discriminative power.
Models trained with 'long2short' maintain better performance under stress.
Abstract
Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
