Cross-Modal Discrete Representation Learning
Alexander H. Liu, SouYoung Jin, Cheng-I Jeff Lai, Andrew Rouditchenko,, Aude Oliva, James Glass

TL;DR
This paper introduces a self-supervised framework for learning fine-grained, cross-modal representations using a shared discretized embedding space, enabling improved cross-modal retrieval and localization without supervision.
Contribution
It proposes a novel discretized embedding space and a cross-modal code matching objective for fine-grained, multi-modal representation learning.
Findings
Discretized representations improve cross-modal retrieval performance.
Shared embedding space captures semantic concepts across modalities.
Discretized clusters align with semantic concepts across different data types.
Abstract
Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector. In this work we present a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities. Beyond the shared embedding space, we propose a Cross-Modal Code Matching objective that forces the representations from different views (modalities) to have a similar distribution over the discrete embedding space such that cross-modal objects/actions localization can be performed without direct supervision. In our experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
