CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty
Johannes Kirmayr, Lukas Stappen, Elisabeth Andr\'e

TL;DR
CAR-bench is a new benchmark designed to evaluate the reliability, consistency, and limit-awareness of LLM agents in real-world, uncertain scenarios like in-car voice assistants, highlighting significant gaps in current capabilities.
Contribution
We introduce CAR-bench, a comprehensive benchmark for assessing LLM agents' handling of uncertainty, consistency, and policy adherence in multi-turn, tool-using dialogue within an in-car domain.
Findings
Large gaps in success rates between occasional and consistent performance.
Frontier LLMs achieve less than 50% success on disambiguation tasks.
Frequent policy violations and hallucinations in challenging scenarios.
Abstract
Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and policy adherence. We introduce CAR-bench, a benchmark for evaluating consistency, uncertainty handling, and capability awareness in multi-turn, tool-using LLM agents in an in-car assistant domain. The environment features an LLM-simulated user, domain policies, and 58 interconnected tools spanning navigation, productivity, charging, and vehicle control. Beyond standard task completion, CAR-bench introduces Hallucination tasks that test agents' limit-awareness under missing tools or information, and Disambiguation tasks that require…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Multimodal Machine Learning Applications
