UAVM: Towards Unifying Audio and Visual Models
Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass

TL;DR
This paper introduces UAVM, a unified model for audio-visual tasks that achieves state-of-the-art classification accuracy and exhibits unique modality-independent properties.
Contribution
The paper presents a novel unified audio-visual model that integrates audio and visual branches, improving performance and revealing new properties.
Findings
Achieves 65.8% accuracy on VGGSound
Uncovers modality-independent properties of UAVM
Sets new state-of-the-art in audio-visual event classification
Abstract
Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
