Speech LLMs are Contextual Reasoning Transcribers

Keqi Deng; Ruchao Fan; Bo Ren; Yiming Wang; Jinyu Li

arXiv:2604.00610·cs.CL·April 2, 2026

Speech LLMs are Contextual Reasoning Transcribers

Keqi Deng, Ruchao Fan, Bo Ren, Yiming Wang, Jinyu Li

PDF

TL;DR

This paper introduces CoT-ASR, a reasoning-based approach for speech recognition that leverages large language models' contextual understanding, improving accuracy and enabling user-guided transcription.

Contribution

It proposes a novel chain-of-thought reasoning framework for LLM-based ASR and introduces a modality adapter to align speech and text representations.

Findings

01

CoT-ASR reduces WER by 8.7% compared to standard LLM-based ASR.

02

It achieves a 16.9% reduction in entity error rate.

03

Supports user-guided transcription with seamless context incorporation.

Abstract

Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.