Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration
Haoxuan Wang

TL;DR
This paper explores integrating large language models with speech recognition systems to improve zero-shot rare word recognition, demonstrating significant performance gains and analyzing key factors like data quality and adapter design.
Contribution
It introduces a novel LLM-ASR architecture that outperforms traditional models in rare word recognition and provides detailed analysis of the factors influencing performance.
Findings
LLM integration significantly reduces rare word error rate
Adapter modules are crucial for aligning speech encoder outputs with LLMs
High-quality labeled data enhances recognition accuracy
Abstract
In this study, we investigate the integration of a large language model (LLM) with an automatic speech recognition (ASR) system, specifically focusing on enhancing rare word recognition performance. Using a 190,000-hour dataset primarily sourced from YouTube, pre-processed with Whisper V3 pseudo-labeling, we demonstrate that the LLM-ASR architecture outperforms traditional Zipformer-Transducer models in the zero-shot rare word recognition task, after training on a large dataset. Our analysis reveals that the LLM contributes significantly to improvements in rare word error rate (R-WER), while the speech encoder primarily determines overall transcription performance (Orthographic Word Error Rate, O-WER, and Normalized Word Error Rate, N-WER). Through extensive ablation studies, we highlight the importance of adapter integration in aligning speech encoder outputs with the LLM's linguistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Data Quality and Management
MethodsAdapter
