STAIR: Learning Sparse Text and Image Representation in Grounded Tokens
Chen Chen, Bowen Zhang, Liangliang Cao, Jiguang Shen, Tom Gunter,, Albin Madappally Jose, Alexander Toshev, Jonathon Shlens, Ruoming Pang,, Yinfei Yang

TL;DR
This paper introduces STAIR, a sparse semantic representation for image and text retrieval that outperforms dense models like CLIP in accuracy while maintaining interpretability and ease of integration.
Contribution
The authors extend CLIP to create a sparse token space representation, achieving superior retrieval performance and interpretability compared to dense embeddings.
Findings
STAIR outperforms CLIP by 4.9% and 4.3% in Recall@1 on COCO-5k retrieval tasks.
STAIR achieves better results on ImageNet zero-shot classification.
Sparse representations can match or surpass dense embeddings in retrieval accuracy.
Abstract
Image and text retrieval is one of the foundational tasks in the vision and language domain with multiple real-world applications. State-of-the-art approaches, e.g. CLIP, ALIGN, represent images and texts as dense embeddings and calculate the similarity in the dense embedding space as the matching score. On the other hand, sparse semantic features like bag-of-words models are more interpretable, but believed to suffer from inferior accuracy than dense representations. In this work, we show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations. We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space. Each token in the space is a (sub-)word in the vocabulary, which is not only interpretable but also easy to integrate with existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsALIGN · Contrastive Language-Image Pre-training
