Deep Latent Space Learning for Cross-modal Mapping of Audio and Visual Signals
Shah Nawaz, Muhammad Kamran Janjua, Ignazio Gallo, Arif Mahmood,, Alessandro Calefati

TL;DR
This paper introduces a deep learning method that creates a shared latent space for audio and visual data, enabling improved cross-modal verification, matching, and retrieval without pairwise supervision.
Contribution
The paper presents a novel single stream network with a new loss function for joint audio-visual representation learning, eliminating the need for pairwise or triplet supervision.
Findings
Achieves state-of-the-art results on cross-modal verification and matching.
Demonstrates effectiveness for cross-modal biometric applications.
Performs comparably on other cross-modal tasks.
Abstract
We propose a novel deep training algorithm for joint representation of audio and visual information which consists of a single stream network (SSNet) coupled with a novel loss function to learn a shared deep latent space representation of multimodal information. The proposed framework characterizes the shared latent space by leveraging the class centers which helps to eliminate the need for pairwise or triplet supervision. We quantitatively and qualitatively evaluate the proposed approach on VoxCeleb, a benchmarks audio-visual dataset on a multitude of tasks including cross-modal verification, cross-modal matching, and cross-modal retrieval. State-of-the-art performance is achieved on cross-modal verification and matching while comparable results are observed on the remaining applications. Our experiments demonstrate the effectiveness of the technique for cross-modal biometric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
