A Comparative Study on Speaker-attributed Automatic Speech Recognition   in Multi-party Meetings

Fan Yu; Zhihao Du; Shiliang Zhang; Yuxiao Lin; Lei Xie

arXiv:2203.16834·cs.SD·July 4, 2022

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings

Fan Yu, Zhihao Du, Shiliang Zhang, Yuxiao Lin, Lei Xie

PDF

Open Access

TL;DR

This paper compares three approaches for speaker-attributed automatic speech recognition in multi-party meetings, highlighting the benefits of word-level diarization and joint separation and recognition methods for improved accuracy.

Contribution

The study introduces and evaluates two novel approaches, WD-SOT and TS-ASR, that address alignment errors and improve speaker-attributed ASR performance.

Findings

01

WD-SOT reduces SD-CER by 10.7% relative to FD-SOT.

02

TS-ASR achieves a 16.5% relative reduction in SD-CER.

03

Experimental results on AliMeeting demonstrate the effectiveness of proposed methods.

Abstract

In this paper, we conduct a comparative study on speaker-attributed automatic speech recognition (SA-ASR) in the multi-party meeting scenario, a topic with increasing attention in meeting rich transcription. Specifically, three approaches are evaluated in this study. The first approach, FD-SOT, consists of a frame-level diarization model to identify speakers and a multi-talker ASR to recognize utterances. The speaker-attributed transcriptions are obtained by aligning the diarization results and recognized hypotheses. However, such an alignment strategy may suffer from erroneous timestamps due to the modular independence, severely hindering the model performance. Therefore, we propose the second approach, WD-SOT, to address alignment errors by introducing a word-level diarization model, which can get rid of such timestamp alignment dependency. To further mitigate the alignment issues, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques