Cross-Modal Discrete Representation Learning

Alexander H. Liu; SouYoung Jin; Cheng-I Jeff Lai; Andrew Rouditchenko,; Aude Oliva; James Glass

arXiv:2106.05438·cs.CV·June 11, 2021

Cross-Modal Discrete Representation Learning

Alexander H. Liu, SouYoung Jin, Cheng-I Jeff Lai, Andrew Rouditchenko,, Aude Oliva, James Glass

PDF

Open Access

TL;DR

This paper introduces a self-supervised framework for learning fine-grained, cross-modal representations using a shared discretized embedding space, enabling improved cross-modal retrieval and localization without supervision.

Contribution

It proposes a novel discretized embedding space and a cross-modal code matching objective for fine-grained, multi-modal representation learning.

Findings

01

Discretized representations improve cross-modal retrieval performance.

02

Shared embedding space captures semantic concepts across modalities.

03

Discretized clusters align with semantic concepts across different data types.

Abstract

Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector. In this work we present a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities. Beyond the shared embedding space, we propose a Cross-Modal Code Matching objective that forces the representations from different views (modalities) to have a similar distribution over the discrete embedding space such that cross-modal objects/actions localization can be performed without direct supervision. In our experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning