Two-Stage Acoustic Adaptation with Gated Cross-Attention Adapters for LLM-Based Multi-Talker Speech Recognition
Hao Shi, Yuan Gao, Xugang Lu, Tatsuya Kawahara

TL;DR
This paper enhances multi-talker speech recognition with large language models by explicitly injecting talker-aware acoustic evidence through a two-stage, parameter-efficient adaptation framework, leading to improved performance especially in three-talker scenarios.
Contribution
It introduces a novel two-stage acoustic adaptation method using gated cross-attention adapters and LoRA updates for robust multi-talker ASR with LLMs.
Findings
Significant performance gains on Libri2Mix/Libri3Mix datasets.
Large improvements in three-talker mixture recognition.
Effective robustness under noisy conditions.
Abstract
Large Language Models (LLMs) are strong decoders for Serialized Output Training (SOT) in two-talker Automatic Speech Recognition (ASR), yet their performance degrades substantially in challenging conditions such as three-talker mixtures. A key limitation is that current systems inject acoustic evidence only through a projected prefix, which can be lossy and imperfectly aligned with the LLM input space, providing insufficient fine-grained grounding during decoding. Addressing this limitation is crucial for robust multi-talker ASR, especially in three-talker mixtures. This paper improves LLM-based multi-talker ASR by explicitly injecting talker-aware acoustic evidence into the decoder. We first revisit Connectionist Temporal Classification (CTC)-derived prefix prompting and compare three variants with increasing acoustic content. The CTC information is obtained using the serialized CTC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
