Index-ASR Technical Report
Zheshu Song, Lu Wang, Wei Deng, Zhuo Yang, Yong Wu, Bin Xia

TL;DR
Index-ASR is a large-scale LLM-based speech recognition system that improves robustness against hallucinations and enables customizable hotword recognition by integrating contextual and background noise data.
Contribution
The paper introduces Index-ASR, a novel LLM-based ASR system that enhances robustness and supports flexible contextual customization, addressing key limitations of existing models.
Findings
Achieves strong performance on open-source benchmarks.
Demonstrates robustness in real-world scenarios.
Supports customizable hotword recognition.
Abstract
Automatic speech recognition (ASR) has witnessed remarkable progress in recent years, largely driven by the emergence of LLM-based ASR paradigm. Despite their strong performance on a variety of open-source benchmarks, existing LLM-based ASR systems still suffer from two critical limitations. First, they are prone to hallucination errors, often generating excessively long and repetitive outputs that are not well grounded in the acoustic input. Second, they provide limited support for flexible and fine-grained contextual customization. To address these challenges, we propose Index-ASR, a large-scale LLM-based ASR system designed to simultaneously enhance robustness and support customizable hotword recognition. The core idea of Index-ASR lies in the integration of LLM and large-scale training data enriched with background noise and contextual information. Experimental results show that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
