Multimodal Transfer Deep Learning with Applications in Audio-Visual Recognition
Seungwhan Moon, Suyoun Kim, Haohan Wang

TL;DR
This paper introduces a transfer deep learning framework that enables knowledge transfer between different modalities, such as from speech to video recognition, using analogy-preserving embeddings and fine-tuning.
Contribution
The novel framework allows transfer of semantic knowledge across modalities by learning analogy-preserving embeddings and fine-tuning networks without changing their topology.
Findings
Effective transfer of knowledge demonstrated on audio-visual datasets.
Framework is flexible for various multimodal datasets and existing networks.
Potential for broad applications in multimodal recognition tasks.
Abstract
We propose a transfer deep learning (TDL) framework that can transfer the knowledge obtained from a single-modal neural network to a network with a different modality. Specifically, we show that we can leverage speech data to fine-tune the network trained for video recognition, given an initial set of audio-video parallel dataset within the same semantics. Our approach first learns the analogy-preserving embeddings between the abstract representations learned from intermediate layers of each network, allowing for semantics-level transfer between the source and target modalities. We then apply our neural network operation that fine-tunes the target network with the additional knowledge transferred from the source network, while keeping the topology of the target network unchanged. While we present an audio-visual recognition task as an application of our approach, our framework is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
