Variational Autoencoder with CCA for Audio-Visual Cross-Modal Retrieval

Jiwei Zhang; Yi Yu; Suhua Tang; Jianming Wu; Wei Li

arXiv:2112.02601·cs.IR·December 7, 2021

Variational Autoencoder with CCA for Audio-Visual Cross-Modal Retrieval

Jiwei Zhang, Yi Yu, Suhua Tang, Jianming Wu, Wei Li

PDF

Open Access

TL;DR

This paper introduces a novel variational autoencoder architecture that leverages CCA to learn joint audio-visual embeddings, significantly improving cross-modal retrieval performance by effectively capturing correlations and reducing discrepancies.

Contribution

The paper proposes a new VAE-based model with CCA-based latent spaces for audio-visual correlation learning, enhancing feature extraction and retrieval accuracy over existing methods.

Findings

01

Outperforms existing cross-modal retrieval methods on benchmark datasets.

02

Effectively learns audio-visual correlations and reduces intra- and inter-modal discrepancies.

03

Demonstrates robustness to noise and missing data in multi-modal information.

Abstract

Cross-modal retrieval is to utilize one modality as a query to retrieve data from another modality, which has become a popular topic in information retrieval, machine learning, and database. How to effectively measure the similarity between different modality data is the major challenge of cross-modal retrieval. Although several reasearch works have calculated the correlation between different modality data via learning a common subspace representation, the encoder's ability to extract features from multi-modal information is not satisfactory. In this paper, we present a novel variational autoencoder (VAE) architecture for audio-visual cross-modal retrieval, by learning paired audio-visual correlation embedding and category correlation embedding as constraints to reinforce the mutuality of audio-visual information. On the one hand, audio encoder and visual encoder separately encode…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Image and Video Retrieval Techniques