Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems
Rumi Allbert, Nima Yazdani, Ali Ansari, Aruj Mahajan, Amirhossein Afsharrad, Seyed Shahabeddin Mousavi

TL;DR
This study empirically compares various speech-to-text, LLM, and text-to-speech combinations in AI interview systems, revealing insights into their performance and user satisfaction.
Contribution
It introduces a large-scale evaluation framework and provides practical guidance for component selection in multimodal conversational AI systems.
Findings
Google STT + GPT-4.1 + Cartesia TTS outperforms other stacks
Objective quality metrics weakly correlate with user satisfaction
The evaluation methodology is validated for human-AI interaction assessment
Abstract
Voice-based conversational AI systems increasingly rely on cascaded architectures that combine speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) components. We present a large-scale empirical comparison of STT x LLM x TTS stacks using data sampled from over 300,000 AI-conducted job interviews. We used an LLM-as-a-Judge automated evaluation framework to assess conversational quality, technical accuracy, and skill assessment capabilities. Our analysis of five production configurations reveals that a stack combining Google's STT, GPT-4.1, and Cartesia's TTS outperforms alternatives in both objective quality metrics and user satisfaction scores. Surprisingly, we find that objective quality metrics correlate weakly with user satisfaction scores, suggesting that user experience in voice-based AI systems depends on factors beyond technical performance. Our findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
