HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

Tu Trinh; Mohamed Elfeki; Guangze Luo; Kelvin Luu; Nathan Hunt; Ernesto Hernandez; Nandan Marwaha; Yannis Yiming He; Charles Wang; Fernando Carabedo; Alessa Castillo; Bing Liu

arXiv:2604.09408·cs.AI·May 6, 2026

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

Tu Trinh, Mohamed Elfeki, Guangze Luo, Kelvin Luu, Nathan Hunt, Ernesto Hernandez, Nandan Marwaha, Yannis Yiming He, Charles Wang, Fernando Carabedo, Alessa Castillo, Bing Liu

PDF

1 Repo

TL;DR

HiL-Bench introduces a benchmark to evaluate agents' ability to recognize when to ask for help, revealing a significant judgment gap in current models and demonstrating trainability of this skill.

Contribution

The paper presents HiL-Bench, a new benchmark with a core metric for help-seeking, and shows that judgment is trainable, improving help-asking behavior in large models.

Findings

01

No frontier model exceeds half of full-information performance in help-seeking.

02

Models exhibit three help-seeking failure patterns: overconfidence, high uncertainty with errors, broad escalation.

03

RL training improves help-seeking quality and transferability across domains.

Abstract

Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Current benchmarks are blind to this failure mode. They supply unambiguous detailed instructions and solely reward execution correctness, so an agent that makes a lucky guess for a missing requirement will score identically to one that would have asked to be certain. We present HiL-Bench (Human-in-the-Loop Benchmark) to measure this selective escalation skill. Each task contains human-validated blockers (missing information, ambiguous requests, contradictory information) that surface only through progressive exploration, not upfront inspection. Our core metric, Ask-F1, the harmonic mean of question precision and blocker recall, captures the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hilbenchauthors/hil-bench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.