CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

Johannes Kirmayr; Lukas Stappen; Elisabeth Andr\'e

arXiv:2601.22027·cs.AI·January 30, 2026

CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

Johannes Kirmayr, Lukas Stappen, Elisabeth Andr\'e

PDF

Open Access 1 Datasets

TL;DR

CAR-bench is a new benchmark designed to evaluate the reliability, consistency, and limit-awareness of LLM agents in real-world, uncertain scenarios like in-car voice assistants, highlighting significant gaps in current capabilities.

Contribution

We introduce CAR-bench, a comprehensive benchmark for assessing LLM agents' handling of uncertainty, consistency, and policy adherence in multi-turn, tool-using dialogue within an in-car domain.

Findings

01

Large gaps in success rates between occasional and consistent performance.

02

Frontier LLMs achieve less than 50% success on disambiguation tasks.

03

Frequent policy violations and hallucinations in challenging scenarios.

Abstract

Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and policy adherence. We introduce CAR-bench, a benchmark for evaluating consistency, uncertainty handling, and capability awareness in multi-turn, tool-using LLM agents in an in-car assistant domain. The environment features an LLM-simulated user, domain policies, and 58 interconnected tools spanning navigation, productivity, charging, and vehicle control. Beyond standard task completion, CAR-bench introduces Hallucination tasks that test agents' limit-awareness under missing tools or information, and Disambiguation tasks that require…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

johanneskirmayr/car-bench-dataset
dataset· 14k dl
14k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Multimodal Machine Learning Applications