A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings
Fan Yu, Zhihao Du, Shiliang Zhang, Yuxiao Lin, Lei Xie

TL;DR
This paper compares three approaches for speaker-attributed automatic speech recognition in multi-party meetings, highlighting the benefits of word-level diarization and joint separation and recognition methods for improved accuracy.
Contribution
The study introduces and evaluates two novel approaches, WD-SOT and TS-ASR, that address alignment errors and improve speaker-attributed ASR performance.
Findings
WD-SOT reduces SD-CER by 10.7% relative to FD-SOT.
TS-ASR achieves a 16.5% relative reduction in SD-CER.
Experimental results on AliMeeting demonstrate the effectiveness of proposed methods.
Abstract
In this paper, we conduct a comparative study on speaker-attributed automatic speech recognition (SA-ASR) in the multi-party meeting scenario, a topic with increasing attention in meeting rich transcription. Specifically, three approaches are evaluated in this study. The first approach, FD-SOT, consists of a frame-level diarization model to identify speakers and a multi-talker ASR to recognize utterances. The speaker-attributed transcriptions are obtained by aligning the diarization results and recognized hypotheses. However, such an alignment strategy may suffer from erroneous timestamps due to the modular independence, severely hindering the model performance. Therefore, we propose the second approach, WD-SOT, to address alignment errors by introducing a word-level diarization model, which can get rid of such timestamp alignment dependency. To further mitigate the alignment issues, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques
