TL;DR
Interactive Benchmarks introduce a new evaluation paradigm that assesses reasoning skills through multi-turn interactions, addressing limitations of existing static and subjective evaluation methods.
Contribution
The paper proposes a unified, interactive evaluation framework for reasoning models, encompassing objective feedback and strategic interaction scenarios.
Findings
Models show significant room for improvement in interactive reasoning tasks.
Interactive benchmarks offer a more robust assessment of reasoning abilities.
Evaluation in interactive settings reveals different strengths and weaknesses compared to static tests.
Abstract
Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
