Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval
Zichen Yuan, Qi Shen, Bingyi Zheng, Yuting Liu, Linying Jiang, Guibing, Guo

TL;DR
This paper introduces a novel cross-modal retrieval framework that fuses video and audio data using a cross-modal mixer and masked autoencoder, significantly improving retrieval accuracy and demonstrating versatility across tasks.
Contribution
The proposed framework uniquely combines a cross-modal mixer with masked autoencoder pre-training to enhance semantic alignment between video and audio modalities.
Findings
Outperforms previous state-of-the-art in video-audio retrieval by up to 2 times
Effective fusion of modalities reduces redundancy and improves semantic understanding
Model transfers well to other downstream tasks as a universal cross-modal model
Abstract
Cross-modal retrieval has become popular in recent years, particularly with the rise of multimedia. Generally, the information from each modality exhibits distinct representations and semantic information, which makes feature tends to be in separate latent spaces encoded with dual-tower architecture and makes it difficult to establish semantic relationships between modalities, resulting in poor retrieval performance. To address this issue, we propose a novel framework for cross-modal retrieval which consists of a cross-modal mixer, a masked autoencoder for pre-training, and a cross-modal retriever for downstream tasks.In specific, we first adopt cross-modal mixer and mask modeling to fuse the original modality and eliminate redundancy. Then, an encoder-decoder architecture is applied to achieve a fuse-then-separate task in the pre-training phase.We feed masked fused representations into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Multimodal Machine Learning Applications
