Benchmarking Contextual Understanding for In-Car Conversational Systems
Philipp Habicht, Lev Sorokin, Abdullah Saydemir, Ken E. Friedl, Andrea Stocco

TL;DR
This paper evaluates the use of Large Language Models with advanced prompting techniques to assess the contextual understanding of in-car conversational systems, demonstrating scalable and cost-effective benchmarking methods.
Contribution
It introduces LLM-based evaluation methods with various prompting techniques for assessing ConvQA systems, highlighting their effectiveness and cost-efficiency compared to traditional human evaluation.
Findings
Advanced prompting improves small non-reasoning models' performance
Reasoning models outperform non-reasoning models in accuracy
DeepSeek-R1 achieves an F1-score of 0.99 at low cost
Abstract
In-Car Conversational Question Answering (ConvQA) systems significantly enhance user experience by enabling seamless voice interactions. However, assessing their accuracy and reliability remains a challenge. This paper explores the use of Large Language Models (LLMs) alongside advanced prompting techniques and agent-based methods to evaluate the extent to which ConvQA system responses adhere to user utterances. The focus lies on contextual understanding and the ability to provide accurate venue recommendations considering user constraints and situational context. To evaluate utterance-response coherence using an LLM, we synthetically generate user utterances accompanied by correct and modified failure-containing system responses. We use input-output, chain-of-thought, self-consistency prompting, and multi-agent prompting techniques with 13 reasoning and non-reasoning LLMs of varying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · AI in Service Interactions
