Index-ASR Technical Report

Zheshu Song; Lu Wang; Wei Deng; Zhuo Yang; Yong Wu; Bin Xia

arXiv:2601.00890·cs.SD·January 6, 2026

Index-ASR Technical Report

Zheshu Song, Lu Wang, Wei Deng, Zhuo Yang, Yong Wu, Bin Xia

PDF

Open Access

TL;DR

Index-ASR is a large-scale LLM-based speech recognition system that improves robustness against hallucinations and enables customizable hotword recognition by integrating contextual and background noise data.

Contribution

The paper introduces Index-ASR, a novel LLM-based ASR system that enhances robustness and supports flexible contextual customization, addressing key limitations of existing models.

Findings

01

Achieves strong performance on open-source benchmarks.

02

Demonstrates robustness in real-world scenarios.

03

Supports customizable hotword recognition.

Abstract

Automatic speech recognition (ASR) has witnessed remarkable progress in recent years, largely driven by the emergence of LLM-based ASR paradigm. Despite their strong performance on a variety of open-source benchmarks, existing LLM-based ASR systems still suffer from two critical limitations. First, they are prone to hallucination errors, often generating excessively long and repetitive outputs that are not well grounded in the acoustic input. Second, they provide limited support for flexible and fine-grained contextual customization. To address these challenges, we propose Index-ASR, a large-scale LLM-based ASR system designed to simultaneously enhance robustness and support customizable hotword recognition. The core idea of Index-ASR lies in the integration of LLM and large-scale training data enriched with background noise and contextual information. Experimental results show that our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing