TL;DR
HiL-Bench introduces a benchmark to evaluate agents' ability to recognize when to ask for help, revealing a significant judgment gap in current models and demonstrating trainability of this skill.
Contribution
The paper presents HiL-Bench, a new benchmark with a core metric for help-seeking, and shows that judgment is trainable, improving help-asking behavior in large models.
Findings
No frontier model exceeds half of full-information performance in help-seeking.
Models exhibit three help-seeking failure patterns: overconfidence, high uncertainty with errors, broad escalation.
RL training improves help-seeking quality and transferability across domains.
Abstract
Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Current benchmarks are blind to this failure mode. They supply unambiguous detailed instructions and solely reward execution correctness, so an agent that makes a lucky guess for a missing requirement will score identically to one that would have asked to be certain. We present HiL-Bench (Human-in-the-Loop Benchmark) to measure this selective escalation skill. Each task contains human-validated blockers (missing information, ambiguous requests, contradictory information) that surface only through progressive exploration, not upfront inspection. Our core metric, Ask-F1, the harmonic mean of question precision and blocker recall, captures the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
