Multimodal Transfer Deep Learning with Applications in Audio-Visual   Recognition

Seungwhan Moon; Suyoun Kim; Haohan Wang

arXiv:1412.3121·cs.NE·February 19, 2016·29 cites

Multimodal Transfer Deep Learning with Applications in Audio-Visual Recognition

Seungwhan Moon, Suyoun Kim, Haohan Wang

PDF

Open Access

TL;DR

This paper introduces a transfer deep learning framework that enables knowledge transfer between different modalities, such as from speech to video recognition, using analogy-preserving embeddings and fine-tuning.

Contribution

The novel framework allows transfer of semantic knowledge across modalities by learning analogy-preserving embeddings and fine-tuning networks without changing their topology.

Findings

01

Effective transfer of knowledge demonstrated on audio-visual datasets.

02

Framework is flexible for various multimodal datasets and existing networks.

03

Potential for broad applications in multimodal recognition tasks.

Abstract

We propose a transfer deep learning (TDL) framework that can transfer the knowledge obtained from a single-modal neural network to a network with a different modality. Specifically, we show that we can leverage speech data to fine-tune the network trained for video recognition, given an initial set of audio-video parallel dataset within the same semantics. Our approach first learns the analogy-preserving embeddings between the abstract representations learned from intermediate layers of each network, allowing for semantics-level transfer between the source and target modalities. We then apply our neural network operation that fine-tunes the target network with the additional knowledge transferred from the source network, while keeping the topology of the target network unchanged. While we present an audio-visual recognition task as an application of our approach, our framework is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis