The Multimodal Information Based Speech Processing (MISP) 2023   Challenge: Audio-Visual Target Speaker Extraction

Shilong Wu; Chenxi Wang; Hang Chen; Yusheng Dai; Chenyue Zhang; Ruoyu; Wang; Hongbo Lan; Jun Du; Chin-Hui Lee; Jingdong Chen; Shinji Watanabe,; Sabato Marco Siniscalchi; Odette Scharenborg; Zhong-Qiu Wang; Jia Pan,; Jianqing Gao

arXiv:2309.08348·eess.AS·September 18, 2023·1 cites

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu, Wang, Hongbo Lan, Jun Du, Chin-Hui Lee, Jingdong Chen, Shinji Watanabe,, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan,, Jianqing Gao

PDF

Open Access

TL;DR

The MISP 2023 challenge introduces a novel audio-visual target speaker extraction task aimed at improving speech recognition accuracy in real-world, complex acoustic environments, setting a new benchmark in multimodal speech processing.

Contribution

This paper presents the first benchmark for AVTSE in real-world scenarios, including task setup, dataset, baseline system, and analysis of challenges in multimodal speech processing.

Findings

01

The task is highly demanding for current systems.

02

Baseline results highlight the difficulty of AVTSE in real environments.

03

The challenge encourages innovative solutions for robust speech extraction.

Abstract

Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhance-ment challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the ac-curacy of back-end speech recognition systems through AVTSE in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsFocus