HAINAN: Fast and Accurate Transducer for Hybrid-Autoregressive ASR
Hainan Xu, Travis M. Bartley, Vladimir Bataev, Boris Ginsburg

TL;DR
HAINAN introduces a flexible speech recognition model that combines autoregressive and non-autoregressive inference, achieving high accuracy and efficiency across multiple languages and datasets.
Contribution
The paper proposes HAINAN, a novel hybrid-inference transducer architecture that supports both inference modes and introduces a semi-autoregressive paradigm for improved accuracy.
Findings
Achieves efficiency parity with CTC and TDT in respective modes.
Outperforms TDT and RNN-T in autoregressive mode.
Significantly outperforms CTC in non-autoregressive mode.
Abstract
We present Hybrid-Autoregressive INference TrANsducers (HAINAN), a novel architecture for speech recognition that extends the Token-and-Duration Transducer (TDT) model. Trained with randomly masked predictor network outputs, HAINAN supports both autoregressive inference with all network components and non-autoregressive inference without the predictor. Additionally, we propose a novel semi-autoregressive inference paradigm that first generates an initial hypothesis using non-autoregressive inference, followed by refinement steps where each token prediction is regenerated using parallelized autoregression on the initial hypothesis. Experiments on multiple datasets across different languages demonstrate that HAINAN achieves efficiency parity with CTC in non-autoregressive mode and with TDT in autoregressive mode. In terms of accuracy, autoregressive HAINAN outperforms TDT and RNN-T, while…
Peer Reviews
Decision·ICLR 2025 Poster
1. Masking target is an effective approach verified in several other models (TDS for CTC, MaskPredict for NAR translation), which can serve as a method of regularization or a way to build non-autoregressive model. Applying to TDT is a clever idea which serves both purposes. 2. Empirical results on large scale study are convincing in terms of the effectiveness of improving NAR and the more efficient decoding methods (NAR, SAR) proposed in the paper based on TDT 3. The paper also presents a good s
1. Technically it is not correct to call the method non-autoregressive. Despite that the predictor does not take the predicted token as input, it still depends on the duration predicted from the previous step to determine what enc[t] to compute argmax on (line 7 of Xu et al. (2023)). Hence, they cannot be fully parallel and should be still considered autoregressive. I acknowledge that runtime wise it is almost identical to NAR because argmax can be precomputed for all (t, u), but this is still w
**Originality**: The paper introduces a simple but effective technique: masking the predictor output 50% of the time during training, which allows the joiner to handle non-autoregressive (NAR) inference. This enables the model to support multiple inference strategies—AR, NAR, and SAR. The carefully crafted NAR and SAR inference strategies leverage the TDT architecture well and show improved performance across multiple datasets. **Quality**: The paper provides thorough experimental evaluations a
1. **Incremental Architectural Innovation**: While HAI-T's hybrid approach to inference is creative, it builds on the existing Token-and-Duration Transducer (TDT) model. The primary novelty lies in its training strategy and inference flexibility rather than any fundamental architectural innovations. This could make the contribution seem incremental. 2. **Lack of Detailed Training/Inference Specifications**: The paper omits crucial details about the training setup, such as the type and number of
* The paper introduces a straightforward approach that requires only a single-line change to the existing joint computation code in the Token-and-Duration (TDT) implementation and performs effectively in practice. * Evaluations are conducted on multiple ASR corpora, including the AMI test, Earnings22, Gigaspeech test, Librispeech test-clean and test-other, Spgispeech test, Tedlium test, and VoxPopuli test.
Although the method is simple, the proposed HAI-T appears to be an incremental improvement over TDT.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsUltrasonics and Acoustic Wave Propagation · Machine Learning and ELM · Blind Source Separation Techniques
