Multimodal Variational Auto-encoder based Audio-Visual Segmentation

Yuxin Mao; Jing Zhang; Mochu Xiang; Yiran Zhong; Yuchao Dai

arXiv:2310.08303·cs.CV·October 13, 2023·2 cites

Multimodal Variational Auto-encoder based Audio-Visual Segmentation

Yuxin Mao, Jing Zhang, Mochu Xiang, Yiran Zhong, Yuchao Dai

PDF

Open Access 1 Repo

TL;DR

This paper introduces ECMVAE, a novel multimodal variational auto-encoder for audio-visual segmentation that explicitly models shared and specific features of each modality, significantly improving performance on AVS tasks.

Contribution

The paper presents ECMVAE, a new explicit representation learning framework that factorizes modality representations and enhances cross-modal shared information for better AVS performance.

Findings

01

Achieved state-of-the-art results on AVSBench with a 3.84 mIOU improvement.

02

Effectively models shared and specific features of audio and visual data.

03

Demonstrated superior segmentation accuracy on the MS3 subset.

Abstract

We propose an Explicit Conditional Multimodal Variational Auto-Encoder (ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the video sequence. Existing AVS methods focus on implicit feature fusion strategies, where models are trained to fit the discrete samples in the dataset. With a limited and less diverse dataset, the resulting performance is usually unsatisfactory. In contrast, we address this problem from an effective representation learning perspective, aiming to model the contribution of each modality explicitly. Specifically, we find that audio contains critical category information of the sound producers, and visual data provides candidate sound producer(s). Their shared information corresponds to the target sound producer(s) shown in the visual data. In this case, cross-modal shared representation learning is especially important for AVS. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opennlplab/mmvae-avs
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Digital Media Forensic Detection

MethodsFocus