SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

Yangze Li; Fan Yu; Yuhao Liang; Pengcheng Guo; Mohan Shi; Zhihao Du,; Shiliang Zhang; Lei Xie

arXiv:2310.04863·cs.SD·October 10, 2023

SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

Yangze Li, Fan Yu, Yuhao Liang, Pengcheng Guo, Mohan Shi, Zhihao Du,, Shiliang Zhang, Lei Xie

PDF

Open Access

TL;DR

This paper introduces SA-Paraformer, a non-autoregressive end-to-end speaker-attributed ASR model that achieves comparable accuracy to state-of-the-art autoregressive models but with significantly faster inference speed, by using parallel decoding and novel strategies.

Contribution

The paper proposes a non-autoregressive Paraformer-based model for SA-ASR, incorporating speaker-filling and inter-CTC strategies to improve performance and speed.

Findings

01

Outperforms cascaded SA-ASR by 6.1% relative SD-CER reduction.

02

Achieves SD-CER of 34.8% with only 1/10 RTF of SOTA models.

03

Demonstrates comparable accuracy to autoregressive models with faster inference.

Abstract

Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR).Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we introduce a recently proposed non-autoregressive model Paraformer as an acoustic model in the SA-ASR model.Paraformer uses a single-step decoder to enable parallel generation, obtaining comparable performance to the SOTA AR transformer models. Besides, we propose a speaker-filling strategy to reduce speaker identification errors and adopt an inter-CTC strategy to enhance the encoder's ability in acoustic modeling. Experiments on the AliMeeting corpus show that our model outperforms the cascaded…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing