Using Text to Teach Image Retrieval
Haoyu Dong, Ze Wang, Qiang Qiu, and Guillermo Sapiro

TL;DR
This paper introduces a novel approach to image retrieval by augmenting image feature manifolds with aligned text data, improving retrieval accuracy especially when data is limited, and presents a new dataset for semantic similarity evaluation.
Contribution
It proposes representing image feature spaces as graphs with geodesic distances and enhances them with geometrically aligned text to improve retrieval performance.
Findings
Text augmentation improves image retrieval accuracy.
Joint embedding manifolds are more robust for retrieval tasks.
New CLEVR-based dataset quantifies semantic similarity between images and text.
Abstract
Image retrieval relies heavily on the quality of the data modeling and the distance measurement in the feature space. Building on the concept of image manifold, we first propose to represent the feature space of images, learned via neural networks, as a graph. Neighborhoods in the feature space are now defined by the geodesic distance between images, represented as graph vertices or manifold samples. When limited images are available, this manifold is sparsely sampled, making the geodesic computation and the corresponding retrieval harder. To address this, we augment the manifold samples with geometrically aligned text, thereby using a plethora of sentences to teach us about images. In addition to extensive results on standard datasets illustrating the power of text to help in image retrieval, a new public dataset based on CLEVR is introduced to quantify the semantic similarity between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
