Interactive Benchmarks

Baoqing Yue; Zihan Zhu; Yutong Han; Brian Fan; Qian Sun; Jichen Feng; Hufei Yang; Yifan Zhang; Mengdi Wang

arXiv:2603.04737·cs.AI·May 19, 2026

Interactive Benchmarks

Baoqing Yue, Zihan Zhu, Yutong Han, Brian Fan, Qian Sun, Jichen Feng, Hufei Yang, Yifan Zhang, Mengdi Wang

PDF

1 Repo

TL;DR

Interactive Benchmarks introduce a new evaluation paradigm that assesses reasoning skills through multi-turn interactions, addressing limitations of existing static and subjective evaluation methods.

Contribution

The paper proposes a unified, interactive evaluation framework for reasoning models, encompassing objective feedback and strategic interaction scenarios.

Findings

01

Models show significant room for improvement in interactive reasoning tasks.

02

Interactive benchmarks offer a more robust assessment of reasoning abilities.

03

Evaluation in interactive settings reveals different strengths and weaknesses compared to static tests.

Abstract

Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

interactivebench/interactivebench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.