Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents
Gil Pasternak, Dheeraj Rajagopal, Julia White, Dhruv Atreja, Matthew Thomas, George Hurn-Maloney, Ash Lewis

TL;DR
This paper introduces PROBE, a benchmark for evaluating proactive problem solving in LLM agents across multiple capabilities, revealing current models' limitations and guiding future improvements.
Contribution
We propose PROBE, a comprehensive benchmark decomposing proactivity into search, identification, and resolution, to evaluate and compare LLM agents' autonomous problem-solving abilities.
Findings
State-of-the-art models achieve around 40% performance on PROBE.
Current models struggle with multi-source reasoning and long-term planning.
Analysis of failure modes suggests directions for future research.
Abstract
LLM-based agents are increasingly moving towards proactivity: rather than awaiting instruction, they exercise agency to anticipate user needs and solve them autonomously. However, evaluating proactivity is challenging; current benchmarks are constrained to localized context, limiting their ability to test reasoning across sources and longer time horizons. To address this gap, we present PROBE (Proactive Resolution Of BottlEnecks). PROBE decomposes proactivity as a pipeline of three core capabilities: (1) searching for unspecified issues, (2) identifying specific bottlenecks, and (3) executing appropriate resolutions. We apply PROBE to evaluate leading LLMs and popular agentic frameworks, showing that even state-of-the-art models struggle to solve this benchmark. Computing our consistent measurements across frontier LLMs and agents, we find that the best end-to-end performance of 40% is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
