T-EMDE: Sketching-based global similarity for cross-modal retrieval

Barbara Rychalska; Mikolaj Wieczorek; Jacek Dabrowski

arXiv:2105.04242·stat.ML·May 11, 2021

T-EMDE: Sketching-based global similarity for cross-modal retrieval

Barbara Rychalska, Mikolaj Wieczorek, Jacek Dabrowski

PDF

Open Access

TL;DR

T-EMDE introduces a trainable, sketching-based module for cross-modal retrieval that reduces complexity and improves performance by bridging modality gaps with standardized histogram representations.

Contribution

It presents T-EMDE, a differentiable, linear-complexity alternative to self-attention for cross-modal retrieval, enabling end-to-end training and enhanced efficiency.

Findings

01

Achieves state-of-the-art results on multiple datasets.

02

Reduces model latency by up to 20%.

03

Facilitates modality communication with standardized histograms.

Abstract

The key challenge in cross-modal retrieval is to find similarities between objects represented with different modalities, such as image and text. However, each modality embeddings stem from non-related feature spaces, which causes the notorious 'heterogeneity gap'. Currently, many cross-modal systems try to bridge the gap with self-attention. However, self-attention has been widely criticized for its quadratic complexity, which prevents many real-life applications. In response to this, we propose T-EMDE - a neural density estimator inspired by the recently introduced Efficient Manifold Density Estimator (EMDE) from the area of recommender systems. EMDE operates on sketches - representations especially suitable for multimodal operations. However, EMDE is non-differentiable and ingests precomputed, static embeddings. With T-EMDE we introduce a trainable version of EMDE which allows full…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning