Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning
Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Po-Yao Huang, Luya Gao, Julius Richter, Sanyuan Chen, Matt Le, Piotr Doll\'ar, Christoph Feichtenhofer, Ann Lee, Wei-Ning Hsu

TL;DR
This paper presents PE-AV, a large-scale multimodal encoder trained with contrastive learning to improve audiovisual perception, enabling new cross-modal tasks and achieving state-of-the-art results across benchmarks.
Contribution
Introduction of PE-AV, a unified audiovisual encoder supporting cross-modal embeddings and large-scale training with synthesized captions for diverse audio-visual data.
Findings
Achieved state-of-the-art performance on standard audio and video benchmarks.
Enabled novel tasks like speech retrieval through unified embeddings.
Improved zero-shot performance by scaling contrastive objectives.
Abstract
We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/pe-av-largemodel· 2.2k dl· ♡ 542.2k dl♡ 54
- 🤗facebook/pe-av-basemodel· 347 dl· ♡ 10347 dl♡ 10
- 🤗facebook/pe-av-smallmodel· 553 dl· ♡ 20553 dl♡ 20
- 🤗facebook/pe-av-small-16-framemodel· 45 dl· ♡ 545 dl♡ 5
- 🤗facebook/pe-a-frame-smallmodel· 92 dl· ♡ 792 dl♡ 7
- 🤗facebook/pe-a-frame-largemodel· 4.6k dl· ♡ 144.6k dl♡ 14
- 🤗facebook/pe-av-large-16-framemodel· 975 dl· ♡ 7975 dl♡ 7
- 🤗facebook/pe-av-base-16-framemodel· 27 dl· ♡ 327 dl♡ 3
- 🤗facebook/pe-a-frame-basemodel· 97 dl· ♡ 697 dl♡ 6
- 🤗PatrickStar1/pe-av-base-16-framemodel· 4 dl4 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
