CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization

Detao Bai; Zhiheng Ma; Xihan Wei; Liefeng Bo

arXiv:2505.03186·cs.SD·May 16, 2025

CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization

Detao Bai, Zhiheng Ma, Xihan Wei, Liefeng Bo

PDF

Open Access 1 Repo 1 Models

TL;DR

CoGenAV is a versatile, data-efficient audio-visual model trained with contrastive and generative objectives, achieving state-of-the-art results across multiple speech and audio-visual tasks, especially in noisy conditions.

Contribution

The paper introduces CoGenAV, a novel contrastive-generative synchronization framework for learning versatile audio-visual representations from limited labeled data.

Findings

01

Achieves 1.27 WER in AVSR on LRS2

02

Attains 20.5 WER in VSR on LRS2

03

Improves noisy speech performance by over 70%

Abstract

The inherent synchronization between a speaker's lip movements, voice, and the underlying linguistic content offers a rich source of information for improving speech processing tasks, especially in challenging conditions where traditional audio-only systems falter. We introduce CoGenAV, a powerful and data-efficient model designed to learn versatile audio-visual representations applicable across a wide range of speech and audio-visual tasks. CoGenAV is trained by optimizing a dual objective derived from natural audio-visual synchrony, contrastive feature alignment and generative text prediction, using only 223 hours of labeled data from the LRS2 dataset. This contrastive-generative synchronization strategy effectively captures fundamental cross-modal correlations. We showcase the effectiveness and versatility of the learned CoGenAV representations on multiple benchmarks. When utilized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

humanmllm/cogenav
pytorchOfficial

Models

🤗
detao/CoGenAV
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies