SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR
Yangze Li, Fan Yu, Yuhao Liang, Pengcheng Guo, Mohan Shi, Zhihao Du,, Shiliang Zhang, Lei Xie

TL;DR
This paper introduces SA-Paraformer, a non-autoregressive end-to-end speaker-attributed ASR model that achieves comparable accuracy to state-of-the-art autoregressive models but with significantly faster inference speed, by using parallel decoding and novel strategies.
Contribution
The paper proposes a non-autoregressive Paraformer-based model for SA-ASR, incorporating speaker-filling and inter-CTC strategies to improve performance and speed.
Findings
Outperforms cascaded SA-ASR by 6.1% relative SD-CER reduction.
Achieves SD-CER of 34.8% with only 1/10 RTF of SOTA models.
Demonstrates comparable accuracy to autoregressive models with faster inference.
Abstract
Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR).Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we introduce a recently proposed non-autoregressive model Paraformer as an acoustic model in the SA-ASR model.Paraformer uses a single-step decoder to enable parallel generation, obtaining comparable performance to the SOTA AR transformer models. Besides, we propose a speaker-filling strategy to reduce speaker identification errors and adopt an inter-CTC strategy to enhance the encoder's ability in acoustic modeling. Experiments on the AliMeeting corpus show that our model outperforms the cascaded…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
