LRS3-TED: a large-scale dataset for visual speech recognition
Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

TL;DR
This paper presents LRS3-TED, a large-scale multi-modal dataset with over 400 hours of TED videos, designed to advance research in visual and audio-visual speech recognition by providing extensive face tracks, subtitles, and word boundaries.
Contribution
The paper introduces LRS3-TED, a significantly larger and more comprehensive dataset for visual speech recognition compared to existing datasets.
Findings
LRS3-TED contains over 400 hours of annotated TED videos.
The dataset includes face tracks, subtitles, and word boundary annotations.
It enables improved training and evaluation of visual speech recognition models.
Abstract
This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Subtitles and Audiovisual Media
