ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models
Thomas De Min, Subhankar Roy, St\'ephane Lathuili\`ere, Elisa Ricci, Massimiliano Mancini

TL;DR
ProactiveBench is a new benchmark for evaluating proactiveness in multimodal large language models, revealing current models' lack of proactive behavior and demonstrating that proactiveness can be learned through reinforcement learning.
Contribution
We introduce ProactiveBench, a comprehensive benchmark for proactiveness in MLLMs, and show that proactiveness can be learned via reinforcement learning, improving model behavior in various tasks.
Findings
Most MLLMs lack proactiveness.
Proactiveness does not correlate with model size.
Reinforcement learning enhances proactiveness.
Abstract
Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The concept of "proactiveness" as a distinct capability is novel and well-motivated for interactive AI systems. The formalization as seeking additional information rather than guessing or refusing is an important research direction. - The benchmark covers diverse visual challenges across seven distinct scenarios. Comprehensive evaluation across 21 models including both open and closed-source systems. - The paper is generally well-written with clear motivation and problem definition.
- Aggressive and insufficiently justified filtering methodology: The 25% threshold removes 58% of samples (from 17,909 to 7,557), which is extremely aggressive. The paper states this focuses evaluation on "proactive behaviors" but doesn't justify why 25% was chosen over 15%, 30%, or other thresholds. This filtering may create an unrepresentatively difficult benchmark that doesn't reflect real-world distributions where some ambiguous queries are answerable. The unfiltered results (Appendix, Table
- Quality: The experimental methodology is sound, with comprehensive evaluation across 21 models and seven scenarios. - Clarity: Good presentation with clear visualizations of benchmark scenarios. - Significance: The benchmark provides a concrete framework for evaluating an important capability, proactiveness, from existing datasets.
1. Literature gap: - The paper fails to adequately mention existing work on evaluating and enhancing active perception and interactive VQA. Similar concepts have been explored under different terminology. [1][2][3][4]. - If author believe that your addressed proactiveness are totally different from existing concpet, active vision or active perception, please justify it with sufficient literal support from both computer science and cognition fields. 2. Benchmark accessibility concern: Whil
- Clearly defines and systematically evaluates proactiveness, the ability of MLLMs to actively request additional information when the current visual context is ambiguous or incomplete. - Covers seven diverse and realistic interaction scenarios, including occlusion, viewpoint change, low image quality, incomplete sketches, temporal ambiguity, and camera movement. - Conducts a large-scale evaluation across 21 MLLMs, reporting both accuracy (acc) and average proactive suggestions (ps), together wi
- **Evaluation format limitation** The main experiments are conducted in a multiple-choice setting with predefined options, which can be relatively constrained and semantically suggestive. This limits the diversity of potential actions and may inflate apparent proactiveness. A richer pool of distractors would better stress-test proactive behavior. - **Lack of statistical significance reporting** Figures and tables visualize average accuracy and proactive-suggestion rates across models/interven
The key idea of measuring "proactiveness" is quite interesting, since it is generally assumed that the MLLM is passively receiving the input with no control over it. The paradigm where the MLLM can change its own input is quite novel, in my opinion.
1. The ps rate of many models, especially frontier models such as GPT 4.1 and o4-mini is unreasonably low. The hint and prompt seem inadequate for eliciting proactiveness, and the results seem misleading. I would suggest the following: (a) Explicitly tell the model (in system prompt if available) that it can control the input, for example, it can control the camera to move left/right. (b) Also, give the model a tool that can do this, instead of simply including it as an option in a multiple-
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
