AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals

Dongliang Zhou; Yakun Zhang; Jinghan Wu; Xingyu Zhang; Liang Xie; Erwei Yin

arXiv:2501.16780·cs.SD·July 8, 2025

AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals

Dongliang Zhou, Yakun Zhang, Jinghan Wu, Xingyu Zhang, Liang Xie, Erwei Yin

PDF

Open Access

TL;DR

The AVE Speech dataset offers a large-scale, multi-modal collection of Mandarin speech data integrating audio, video, and EMG signals, aiming to improve speech recognition especially in noisy and cross-subject scenarios.

Contribution

This paper introduces the first publicly available multi-modal Mandarin speech dataset combining audio, visual, and EMG signals for large-scale recognition tasks.

Findings

01

Multi-modal data significantly improves speech recognition accuracy.

02

Combining modalities enhances performance in noisy environments.

03

The dataset supports cross-subject and speaker-independent research.

Abstract

The global aging population faces considerable challenges, particularly in communication, due to the prevalence of hearing and speech impairments. To address these, we introduce the AVE speech, a comprehensive multi-modal dataset for speech recognition tasks. The dataset includes a 100-sentence Mandarin corpus with audio signals, lip-region video recordings, and six-channel electromyography (EMG) data, collected from 100 participants. Each subject read the entire corpus ten times, with each sentence averaging approximately two seconds in duration, resulting in over 55 hours of multi-modal speech data per modality. Experiments demonstrate that combining these modalities significantly improves recognition performance, particularly in cross-subject and high-noise environments. To our knowledge, this is the first publicly available sentence-level dataset integrating these three modalities…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis