The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition
Zhe Wang, Shilong Wu, Hang Chen, Mao-Kui He, Jun Du, Chin-Hui Lee,, Jingdong Chen, Shinji Watanabe, Sabato Siniscalchi, Odette Scharenborg,, Diyuan Liu, Baocai Yin, Jia Pan, Jianqing Gao, Cong Liu

TL;DR
The paper introduces the MISP2022 challenge focusing on audio-visual speaker diarization and recognition in Chinese, using real-world home-TV scenarios with noisy, far-field audio and video data.
Contribution
It presents the dataset, challenge tracks, and baseline systems for audio-visual diarization and recognition in complex, noisy environments.
Findings
Baseline system shows good performance on AVDR task.
Challenges include far-field video quality and background TV noise.
Indistinguishable speakers pose additional difficulties.
Abstract
The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve ``who spoken when'' using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing ``who spoken what when'' with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
