The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?

Jayadev Billa

arXiv:2602.17598·cs.CL·March 9, 2026

The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?

Jayadev Billa

PDF

Open Access

TL;DR

This paper introduces a methodology to compare speech LLMs with ASR→LLM pipelines, revealing conditions where speech LLMs behave similarly or worse, especially under noisy environments, and provides mechanistic insights into their behavior.

Contribution

It presents a novel evaluation approach and mechanistic analysis that distinguish speech LLM behavior from underlying LLM reasoning, highlighting their limitations and conditions affecting performance.

Findings

01

Speech LLMs are often more expensive than cascades.

02

Under noise, speech LLMs perform worse than cascades.

03

Clean conditions favor speech LLMs, with advantages reversing at 0dB noise.

Abstract

Speech LLMs are widely understood to be better than ASR $\to$ LLM cascades since they have access to the audio directly, and not just the transcript. In this paper, we present an evaluation methodology and a mechanistic interpretation of the observed behavior of speech LLMs. First, we introduce matched-backbone testing which separates out the behavior of the speech LLM from the reasoning capabilities of the underlying LLM. Second, we provide a mechanistic analysis of speech LLMs using logit lens and LEACE and show the literal transcript emerging from the LLM's hidden states and that text representations are causally necessary. We also show that in most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0dB.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems