Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems

Rumi Allbert; Nima Yazdani; Ali Ansari; Aruj Mahajan; Amirhossein Afsharrad; Seyed Shahabeddin Mousavi

arXiv:2507.16835·eess.AS·August 25, 2025

Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems

Rumi Allbert, Nima Yazdani, Ali Ansari, Aruj Mahajan, Amirhossein Afsharrad, Seyed Shahabeddin Mousavi

PDF

Open Access

TL;DR

This study empirically compares various speech-to-text, LLM, and text-to-speech combinations in AI interview systems, revealing insights into their performance and user satisfaction.

Contribution

It introduces a large-scale evaluation framework and provides practical guidance for component selection in multimodal conversational AI systems.

Findings

01

Google STT + GPT-4.1 + Cartesia TTS outperforms other stacks

02

Objective quality metrics weakly correlate with user satisfaction

03

The evaluation methodology is validated for human-AI interaction assessment

Abstract

Voice-based conversational AI systems increasingly rely on cascaded architectures that combine speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) components. We present a large-scale empirical comparison of STT x LLM x TTS stacks using data sampled from over 300,000 AI-conducted job interviews. We used an LLM-as-a-Judge automated evaluation framework to assess conversational quality, technical accuracy, and skill assessment capabilities. Our analysis of five production configurations reveals that a stack combining Google's STT, GPT-4.1, and Cartesia's TTS outperforms alternatives in both objective quality metrics and user satisfaction scores. Surprisingly, we find that objective quality metrics correlate weakly with user satisfaction scores, suggesting that user experience in voice-based AI systems depends on factors beyond technical performance. Our findings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling