ProactBench: Beyond What The User Asked For
Sepehr Harfi, Ahmad Salimi, Dongming Shen, Alex Smola

TL;DR
ProactBench introduces a new benchmark for evaluating conversational proactivity in language models, focusing on their ability to infer and act on implied user needs across three distinct phases.
Contribution
It operationalizes a novel benchmark with curated dialogues and a multi-agent setup to measure and analyze different types of conversational proactivity in language models.
Findings
Recovery phase is challenging and poorly predicted by existing benchmarks.
ProactBench's corpus includes 198 dialogues with 624 trigger points.
Standard benchmarks do not effectively predict Recovery performance.
Abstract
Most LLM benchmarks score how well a model responds to explicit requests. They leave unmeasured a different conversational ability: noticing and acting on needs the user has implied but not said. We call this \emph{conversational proactivity}. ProactBench decomposes it into three phase-tied types: \textsc{Emergent}, inference from a single disclosed anchor; \textsc{Critical}, synthesis across multiple anchors; and \textsc{Recovery}, grounded forward-looking value after task completion. We operationalise the benchmark with three agents: a Planner, a User Agent, and an Assistant Model. Their information asymmetries defend against style-confounded scoring, rubric leakage, external-context contamination, and information dumps. The released corpus contains 198 curated dialogues with 624 trigger points across 24 communication styles drawn from a psychometric inventory and audited by an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
