The Multimodal Information based Speech Processing (MISP) 2022   Challenge: Audio-Visual Diarization and Recognition

Zhe Wang; Shilong Wu; Hang Chen; Mao-Kui He; Jun Du; Chin-Hui Lee,; Jingdong Chen; Shinji Watanabe; Sabato Siniscalchi; Odette Scharenborg,; Diyuan Liu; Baocai Yin; Jia Pan; Jianqing Gao; Cong Liu

arXiv:2303.06326·cs.MM·March 14, 2023·1 cites

The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

Zhe Wang, Shilong Wu, Hang Chen, Mao-Kui He, Jun Du, Chin-Hui Lee,, Jingdong Chen, Shinji Watanabe, Sabato Siniscalchi, Odette Scharenborg,, Diyuan Liu, Baocai Yin, Jia Pan, Jianqing Gao, Cong Liu

PDF

Open Access

TL;DR

The paper introduces the MISP2022 challenge focusing on audio-visual speaker diarization and recognition in Chinese, using real-world home-TV scenarios with noisy, far-field audio and video data.

Contribution

It presents the dataset, challenge tracks, and baseline systems for audio-visual diarization and recognition in complex, noisy environments.

Findings

01

Baseline system shows good performance on AVDR task.

02

Challenges include far-field video quality and background TV noise.

03

Indistinguishable speakers pose additional difficulties.

Abstract

The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve ``who spoken when'' using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing ``who spoken what when'' with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis