The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?
Jayadev Billa

TL;DR
This paper introduces a methodology to compare speech LLMs with ASR→LLM pipelines, revealing conditions where speech LLMs behave similarly or worse, especially under noisy environments, and provides mechanistic insights into their behavior.
Contribution
It presents a novel evaluation approach and mechanistic analysis that distinguish speech LLM behavior from underlying LLM reasoning, highlighting their limitations and conditions affecting performance.
Findings
Speech LLMs are often more expensive than cascades.
Under noise, speech LLMs perform worse than cascades.
Clean conditions favor speech LLMs, with advantages reversing at 0dB noise.
Abstract
Speech LLMs are widely understood to be better than ASRLLM cascades since they have access to the audio directly, and not just the transcript. In this paper, we present an evaluation methodology and a mechanistic interpretation of the observed behavior of speech LLMs. First, we introduce matched-backbone testing which separates out the behavior of the speech LLM from the reasoning capabilities of the underlying LLM. Second, we provide a mechanistic analysis of speech LLMs using logit lens and LEACE and show the literal transcript emerging from the LLM's hidden states and that text representations are causally necessary. We also show that in most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0dB.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
