CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization
Detao Bai, Zhiheng Ma, Xihan Wei, Liefeng Bo

TL;DR
CoGenAV is a versatile, data-efficient audio-visual model trained with contrastive and generative objectives, achieving state-of-the-art results across multiple speech and audio-visual tasks, especially in noisy conditions.
Contribution
The paper introduces CoGenAV, a novel contrastive-generative synchronization framework for learning versatile audio-visual representations from limited labeled data.
Findings
Achieves 1.27 WER in AVSR on LRS2
Attains 20.5 WER in VSR on LRS2
Improves noisy speech performance by over 70%
Abstract
The inherent synchronization between a speaker's lip movements, voice, and the underlying linguistic content offers a rich source of information for improving speech processing tasks, especially in challenging conditions where traditional audio-only systems falter. We introduce CoGenAV, a powerful and data-efficient model designed to learn versatile audio-visual representations applicable across a wide range of speech and audio-visual tasks. CoGenAV is trained by optimizing a dual objective derived from natural audio-visual synchrony, contrastive feature alignment and generative text prediction, using only 223 hours of labeled data from the LRS2 dataset. This contrastive-generative synchronization strategy effectively captures fundamental cross-modal correlations. We showcase the effectiveness and versatility of the learned CoGenAV representations on multiple benchmarks. When utilized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
