UAVM: Towards Unifying Audio and Visual Models

Yuan Gong; Alexander H. Liu; Andrew Rouditchenko; James Glass

arXiv:2208.00061·cs.CV·February 17, 2023·1 cites

UAVM: Towards Unifying Audio and Visual Models

Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass

PDF

Open Access 1 Repo

TL;DR

This paper introduces UAVM, a unified model for audio-visual tasks that achieves state-of-the-art classification accuracy and exhibits unique modality-independent properties.

Contribution

The paper presents a novel unified audio-visual model that integrates audio and visual branches, improving performance and revealing new properties.

Findings

01

Achieves 65.8% accuracy on VGGSound

02

Uncovers modality-independent properties of UAVM

03

Sets new state-of-the-art in audio-visual event classification

Abstract

Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YuanGongND/uavm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis