TL;DR
LRS-VoxMM is a new in-the-wild benchmark dataset for audio-visual speech recognition, emphasizing diverse real-world scenarios and challenging acoustic conditions to advance research in the field.
Contribution
It introduces a comprehensive, challenging AVSR benchmark derived from VoxMM, with distorted evaluation sets to test robustness under severe acoustic degradation.
Findings
LRS-VoxMM is significantly more challenging than LRS3.
Visual information becomes more beneficial as audio quality decreases.
The benchmark supports realistic AVSR evaluation under diverse conditions.
Abstract
We introduce LRS-VoxMM, an in-the-wild benchmark for audio-visual speech recognition (AVSR). The benchmark is derived from VoxMM, a dataset of diverse real-world spoken conversations with human-annotated transcriptions. We select AVSR-suitable samples and preprocess them in an LRS-style format for direct use in existing AVSR pipelines. Compared with commonly used benchmarks, LRS-VoxMM covers a more diverse range of scenarios and acoustic conditions. We also release distorted evaluation sets with additive noise, reverberation, and bandwidth limitation to support evaluation under severe acoustic degradation. Experimental results show that LRS-VoxMM is considerably harder than LRS3 and that the contribution of visual information becomes more evident as the audio signal degrades. LRS-VoxMM supports more realistic AVSR benchmarking and encourages further research on the role of visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
