Video and Audio are Images: A Cross-Modal Mixer for Original Data on   Video-Audio Retrieval

Zichen Yuan; Qi Shen; Bingyi Zheng; Yuting Liu; Linying Jiang; Guibing; Guo

arXiv:2308.13820·cs.IR·August 29, 2023

Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval

Zichen Yuan, Qi Shen, Bingyi Zheng, Yuting Liu, Linying Jiang, Guibing, Guo

PDF

Open Access

TL;DR

This paper introduces a novel cross-modal retrieval framework that fuses video and audio data using a cross-modal mixer and masked autoencoder, significantly improving retrieval accuracy and demonstrating versatility across tasks.

Contribution

The proposed framework uniquely combines a cross-modal mixer with masked autoencoder pre-training to enhance semantic alignment between video and audio modalities.

Findings

01

Outperforms previous state-of-the-art in video-audio retrieval by up to 2 times

02

Effective fusion of modalities reduces redundancy and improves semantic understanding

03

Model transfers well to other downstream tasks as a universal cross-modal model

Abstract

Cross-modal retrieval has become popular in recent years, particularly with the rise of multimedia. Generally, the information from each modality exhibits distinct representations and semantic information, which makes feature tends to be in separate latent spaces encoded with dual-tower architecture and makes it difficult to establish semantic relationships between modalities, resulting in poor retrieval performance. To address this issue, we propose a novel framework for cross-modal retrieval which consists of a cross-modal mixer, a masked autoencoder for pre-training, and a cross-modal retriever for downstream tasks.In specific, we first adopt cross-modal mixer and mask modeling to fuse the original modality and eliminate redundancy. Then, an encoder-decoder architecture is applied to achieve a fuse-then-separate task in the pre-training phase.We feed masked fused representations into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Multimodal Machine Learning Applications