TL;DR
This paper introduces Text2Vis, a neural network that translates textual descriptions into visual features for image search, enabling efficient and flexible retrieval in visual space without reprocessing large image collections.
Contribution
The paper presents a novel neural network model, Text2Vis, that maps text to visual features for image search, incorporating dual loss functions for improved semantic and visual accuracy.
Findings
Preliminary results on MS-COCO dataset demonstrate promising performance.
The approach allows updates to the translation model without reprocessing the image collection.
Using dual loss functions improves the semantic relevance of retrieved images.
Abstract
In this paper we tackle the problem of image search when the query is a short textual description of the image the user is looking for. We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation. Searching in the visual feature space has the advantage that any update to the translation model does not require to reprocess the, typically huge, image collection on which the search is performed. We propose Text2Vis, a neural network that generates a visual representation, in the visual feature space of the fc6-fc7 layers of ImageNet, from a short descriptive text. Text2Vis optimizes two loss functions, using a stochastic loss-selection method. A visual-focused loss is aimed at learning the actual text-to-visual feature mapping, while a text-focused loss is aimed at modeling the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
