REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

Zhuoshi Pan; Qizhi Pei; Yu Li; Qiyao Sun; Zinan Tang; H. Vicky Zhao; Conghui He; Lijun Wu

arXiv:2507.10541·cs.CL·July 16, 2025

REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

Zhuoshi Pan, Qizhi Pei, Yu Li, Qiyao Sun, Zinan Tang, H. Vicky Zhao, Conghui He, Lijun Wu

PDF

1 Repo

TL;DR

REST introduces a stress-testing framework for large reasoning models by evaluating their performance on multiple problems simultaneously, revealing weaknesses not apparent in traditional single-question benchmarks.

Contribution

This work presents REST, a novel multi-problem stress-testing framework that better assesses reasoning models' robustness and capabilities under realistic, multi-context conditions.

Findings

01

State-of-the-art models degrade significantly under REST stress tests.

02

REST outperforms existing benchmarks in discriminative power.

03

Models trained with 'long2short' maintain better performance under stress.

Abstract

Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opendatalab/REST
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.