A Comparative Study on Multichannel Speaker-Attributed Automatic Speech   Recognition in Multi-party Meetings

Mohan Shi; Jie Zhang; Zhihao Du; Fan Yu; Qian Chen; Shiliang Zhang,; Li-Rong Dai

arXiv:2211.00511·eess.AS·March 3, 2023

A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings

Mohan Shi, Jie Zhang, Zhihao Du, Fan Yu, Qian Chen, Shiliang Zhang,, Li-Rong Dai

PDF

Open Access

TL;DR

This paper introduces three multichannel approaches for speaker-attributed automatic speech recognition in multi-party meetings, demonstrating improved performance over single-channel methods through innovative data fusion strategies.

Contribution

It proposes three novel multichannel SA-ASR models with specific data fusion techniques, advancing the state-of-the-art in multi-party meeting speech recognition.

Findings

01

Multichannel models outperform single-channel counterparts.

02

Channel-level and frame-level attention improve recognition accuracy.

03

Neural beamforming enhances multichannel speech processing.

Abstract

Speaker-attributed automatic speech recognition (SA-ASR) in multi-party meeting scenarios is one of the most valuable and challenging ASR task. It was shown that single-channel frame-level diarization with serialized output training (SC-FD-SOT), single-channel word-level diarization with SOT (SC-WD-SOT) and joint training of single-channel target-speaker separation and ASR (SC-TS-ASR) can be exploited to partially solve this problem. In this paper, we propose three corresponding multichannel (MC) SA-ASR approaches, namely MC-FD-SOT, MC-WD-SOT and MC-TS-ASR. For different tasks/models, different multichannel data fusion strategies are considered, including channel-level cross-channel attention for MC-FD-SOT, frame-level cross-channel attention for MC-WD-SOT and neural beamforming for MC-TS-ASR. Results on the AliMeeting corpus reveal that our proposed models can consistently outperform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing