SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End   Crossmodal Audio Token Synchronization

Young Jin Ahn; Jungwoo Park; Sangha Park; Jonghyun Choi and; Kee-Eung Kim

arXiv:2406.12233·cs.AI·June 19, 2024·1 cites

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization

Young Jin Ahn, Jungwoo Park, Sangha Park, Jonghyun Choi and, Kee-Eung Kim

PDF

Open Access 1 Repo

TL;DR

SyncVSR is an end-to-end visual speech recognition framework that uses crossmodal audio synchronization to improve accuracy, reduce data requirements, and handle homophenes effectively across languages and modalities.

Contribution

It introduces a novel audio-visual synchronization method using quantized audio tokens for frame-level supervision in VSR.

Findings

01

Achieves state-of-the-art VSR performance.

02

Reduces data usage by up to nine times.

03

Demonstrates versatility across tasks, languages, and modalities.

Abstract

Visual Speech Recognition (VSR) stands at the intersection of computer vision and speech recognition, aiming to interpret spoken content from visual cues. A prominent challenge in VSR is the presence of homophenes-visually similar lip gestures that represent different phonemes. Prior approaches have sought to distinguish fine-grained visemes by aligning visual and auditory semantics, but often fell short of full synchronization. To address this, we present SyncVSR, an end-to-end learning framework that leverages quantized audio for frame-level crossmodal supervision. By integrating a projection layer that synchronizes visual representation with acoustic data, our encoder learns to generate discrete audio tokens from a video sequence in a non-autoregressive manner. SyncVSR shows versatility across tasks, languages, and modalities at the cost of a forward pass. Our empirical evaluations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KAIST-AILab/SyncVSR
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis