Learning Joint Embedding for Cross-Modal Retrieval

Donghuo Zeng

arXiv:1908.07673·cs.IR·August 22, 2019

Learning Joint Embedding for Cross-Modal Retrieval

Donghuo Zeng

PDF

Open Access

TL;DR

This paper introduces a novel deep learning architecture that improves cross-modal retrieval by better aligning heterogeneous data modalities, especially addressing temporal structure gaps, using triplet neural networks.

Contribution

It proposes the S-DCCA architecture combined with triplet neural networks to enhance correlation learning for cross-modal retrieval tasks.

Findings

01

TNN-based architecture achieves superior retrieval performance.

02

Supervised learning of data representations improves correlation accuracy.

03

The method effectively addresses temporal structure gaps in multimodal data.

Abstract

A cross-modal retrieval process is to use a query in one modality to obtain relevant data in another modality. The challenging issue of cross-modal retrieval lies in bridging the heterogeneous gap for similarity computation, which has been broadly discussed in image-text, audio-text, and video-text cross-modal multimedia data mining and retrieval. However, the gap in temporal structures of different data modalities is not well addressed due to the lack of alignment relationship between temporal cross-modal structures. Our research focuses on learning the correlation between different modalities for the task of cross-modal retrieval. We have proposed an architecture: Supervised-Deep Canonical Correlation Analysis (S-DCCA), for cross-modal retrieval. In this forum paper, we will talk about how to exploit triplet neural networks (TNN) to enhance the correlation learning for cross-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Image Retrieval and Classification Techniques