Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

Apoorv Vyas; Heng-Jui Chang; Cheng-Fu Yang; Po-Yao Huang; Luya Gao; Julius Richter; Sanyuan Chen; Matt Le; Piotr Doll\'ar; Christoph Feichtenhofer; Ann Lee; Wei-Ning Hsu

arXiv:2512.19687·cs.SD·December 23, 2025

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Po-Yao Huang, Luya Gao, Julius Richter, Sanyuan Chen, Matt Le, Piotr Doll\'ar, Christoph Feichtenhofer, Ann Lee, Wei-Ning Hsu

PDF

Open Access 10 Models

TL;DR

This paper presents PE-AV, a large-scale multimodal encoder trained with contrastive learning to improve audiovisual perception, enabling new cross-modal tasks and achieving state-of-the-art results across benchmarks.

Contribution

Introduction of PE-AV, a unified audiovisual encoder supporting cross-modal embeddings and large-scale training with synthesized captions for diverse audio-visual data.

Findings

01

Achieved state-of-the-art performance on standard audio and video benchmarks.

02

Enabled novel tasks like speech retrieval through unified embeddings.

03

Improved zero-shot performance by scaling contrastive objectives.

Abstract

We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications