OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning
Sheng Liu, Kevin Lin, Lijuan Wang, Junsong Yuan, Zicheng Liu

TL;DR
This paper introduces open-vocabulary visual instance search (OVIS), enabling retrieval of specific image patches based on arbitrary textual queries by learning a cross-modal semantic space using weak image-caption supervision.
Contribution
The paper proposes ViSA, a novel visual-semantic aligned representation learning method for open-vocabulary instance search, and provides new datasets and evaluation pipelines.
Findings
ViSA effectively aligns visual instances with textual queries.
Achieves 21.9% mAP@50 on OVIS40 dataset.
Demonstrates robustness to uncommon words in queries.
Abstract
We introduce the task of open-vocabulary visual instance search (OVIS). Given an arbitrary textual search query, Open-vocabulary Visual Instance Search (OVIS) aims to return a ranked list of visual instances, i.e., image patches, that satisfies the search intent from an image database. The term "open vocabulary" means that there are neither restrictions to the visual instance to be searched nor restrictions to the word that can be used to compose the textual search query. We propose to address such a search challenge via visual-semantic aligned representation learning (ViSA). ViSA leverages massive image-caption pairs as weak image-level (not instance-level) supervision to learn a rich cross-modal semantic space where the representations of visual instances (not images) and those of textual queries are aligned, thus allowing us to measure the similarities between any visual instance and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
