Speech LLMs are Contextual Reasoning Transcribers
Keqi Deng, Ruchao Fan, Bo Ren, Yiming Wang, Jinyu Li

TL;DR
This paper introduces CoT-ASR, a reasoning-based approach for speech recognition that leverages large language models' contextual understanding, improving accuracy and enabling user-guided transcription.
Contribution
It proposes a novel chain-of-thought reasoning framework for LLM-based ASR and introduces a modality adapter to align speech and text representations.
Findings
CoT-ASR reduces WER by 8.7% compared to standard LLM-based ASR.
It achieves a 16.9% reduction in entity error rate.
Supports user-guided transcription with seamless context incorporation.
Abstract
Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
