T-EMDE: Sketching-based global similarity for cross-modal retrieval
Barbara Rychalska, Mikolaj Wieczorek, Jacek Dabrowski

TL;DR
T-EMDE introduces a trainable, sketching-based module for cross-modal retrieval that reduces complexity and improves performance by bridging modality gaps with standardized histogram representations.
Contribution
It presents T-EMDE, a differentiable, linear-complexity alternative to self-attention for cross-modal retrieval, enabling end-to-end training and enhanced efficiency.
Findings
Achieves state-of-the-art results on multiple datasets.
Reduces model latency by up to 20%.
Facilitates modality communication with standardized histograms.
Abstract
The key challenge in cross-modal retrieval is to find similarities between objects represented with different modalities, such as image and text. However, each modality embeddings stem from non-related feature spaces, which causes the notorious 'heterogeneity gap'. Currently, many cross-modal systems try to bridge the gap with self-attention. However, self-attention has been widely criticized for its quadratic complexity, which prevents many real-life applications. In response to this, we propose T-EMDE - a neural density estimator inspired by the recently introduced Efficient Manifold Density Estimator (EMDE) from the area of recommender systems. EMDE operates on sketches - representations especially suitable for multimodal operations. However, EMDE is non-differentiable and ingests precomputed, static embeddings. With T-EMDE we introduce a trainable version of EMDE which allows full…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
