Variational Autoencoder with CCA for Audio-Visual Cross-Modal Retrieval
Jiwei Zhang, Yi Yu, Suhua Tang, Jianming Wu, Wei Li

TL;DR
This paper introduces a novel variational autoencoder architecture that leverages CCA to learn joint audio-visual embeddings, significantly improving cross-modal retrieval performance by effectively capturing correlations and reducing discrepancies.
Contribution
The paper proposes a new VAE-based model with CCA-based latent spaces for audio-visual correlation learning, enhancing feature extraction and retrieval accuracy over existing methods.
Findings
Outperforms existing cross-modal retrieval methods on benchmark datasets.
Effectively learns audio-visual correlations and reduces intra- and inter-modal discrepancies.
Demonstrates robustness to noise and missing data in multi-modal information.
Abstract
Cross-modal retrieval is to utilize one modality as a query to retrieve data from another modality, which has become a popular topic in information retrieval, machine learning, and database. How to effectively measure the similarity between different modality data is the major challenge of cross-modal retrieval. Although several reasearch works have calculated the correlation between different modality data via learning a common subspace representation, the encoder's ability to extract features from multi-modal information is not satisfactory. In this paper, we present a novel variational autoencoder (VAE) architecture for audio-visual cross-modal retrieval, by learning paired audio-visual correlation embedding and category correlation embedding as constraints to reinforce the mutuality of audio-visual information. On the one hand, audio encoder and visual encoder separately encode…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Image and Video Retrieval Techniques
