LRS3-TED: a large-scale dataset for visual speech recognition

Triantafyllos Afouras; Joon Son Chung; Andrew Zisserman

arXiv:1809.00496·cs.CV·October 30, 2018·281 cites

LRS3-TED: a large-scale dataset for visual speech recognition

Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

PDF

Open Access 1 Repo

TL;DR

This paper presents LRS3-TED, a large-scale multi-modal dataset with over 400 hours of TED videos, designed to advance research in visual and audio-visual speech recognition by providing extensive face tracks, subtitles, and word boundaries.

Contribution

The paper introduces LRS3-TED, a significantly larger and more comprehensive dataset for visual speech recognition compared to existing datasets.

Findings

01

LRS3-TED contains over 400 hours of annotated TED videos.

02

The dataset includes face tracks, subtitles, and word boundary annotations.

03

It enables improved training and evaluation of visual speech recognition models.

Abstract

This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jaejunl/hyface
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Subtitles and Audiovisual Media