Benchmarking Contextual Understanding for In-Car Conversational Systems

Philipp Habicht; Lev Sorokin; Abdullah Saydemir; Ken E. Friedl; Andrea Stocco

arXiv:2512.12042·cs.CL·December 16, 2025

Benchmarking Contextual Understanding for In-Car Conversational Systems

Philipp Habicht, Lev Sorokin, Abdullah Saydemir, Ken E. Friedl, Andrea Stocco

PDF

Open Access

TL;DR

This paper evaluates the use of Large Language Models with advanced prompting techniques to assess the contextual understanding of in-car conversational systems, demonstrating scalable and cost-effective benchmarking methods.

Contribution

It introduces LLM-based evaluation methods with various prompting techniques for assessing ConvQA systems, highlighting their effectiveness and cost-efficiency compared to traditional human evaluation.

Findings

01

Advanced prompting improves small non-reasoning models' performance

02

Reasoning models outperform non-reasoning models in accuracy

03

DeepSeek-R1 achieves an F1-score of 0.99 at low cost

Abstract

In-Car Conversational Question Answering (ConvQA) systems significantly enhance user experience by enabling seamless voice interactions. However, assessing their accuracy and reliability remains a challenge. This paper explores the use of Large Language Models (LLMs) alongside advanced prompting techniques and agent-based methods to evaluate the extent to which ConvQA system responses adhere to user utterances. The focus lies on contextual understanding and the ability to provide accurate venue recommendations considering user constraints and situational context. To evaluate utterance-response coherence using an LLM, we synthetically generate user utterances accompanied by correct and modified failure-containing system responses. We use input-output, chain-of-thought, self-consistency prompting, and multi-agent prompting techniques with 13 reasoning and non-reasoning LLMs of varying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · AI in Service Interactions