LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

Doyeop Kwak; Jeongsoo Choi; Suyeon Lee; Joon Son Chung

arXiv:2604.27866·eess.AS·May 1, 2026

LRS-VoxMM: A benchmark for in-the-wild audio-visual speech recognition

Doyeop Kwak, Jeongsoo Choi, Suyeon Lee, Joon Son Chung

PDF

1 Repo

TL;DR

LRS-VoxMM is a new in-the-wild benchmark dataset for audio-visual speech recognition, emphasizing diverse real-world scenarios and challenging acoustic conditions to advance research in the field.

Contribution

It introduces a comprehensive, challenging AVSR benchmark derived from VoxMM, with distorted evaluation sets to test robustness under severe acoustic degradation.

Findings

01

LRS-VoxMM is significantly more challenging than LRS3.

02

Visual information becomes more beneficial as audio quality decreases.

03

The benchmark supports realistic AVSR evaluation under diverse conditions.

Abstract

We introduce LRS-VoxMM, an in-the-wild benchmark for audio-visual speech recognition (AVSR). The benchmark is derived from VoxMM, a dataset of diverse real-world spoken conversations with human-annotated transcriptions. We select AVSR-suitable samples and preprocess them in an LRS-style format for direct use in existing AVSR pipelines. Compared with commonly used benchmarks, LRS-VoxMM covers a more diverse range of scenarios and acoustic conditions. We also release distorted evaluation sets with additive noise, reverberation, and bandwidth limitation to support evaluation under severe acoustic degradation. Experimental results show that LRS-VoxMM is considerably harder than LRS3 and that the contribution of visual information becomes more evident as the audio signal degrades. LRS-VoxMM supports more realistic AVSR benchmarking and encourages further research on the role of visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kaistmm/VoxMM
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.