STAIR: Learning Sparse Text and Image Representation in Grounded Tokens

Chen Chen; Bowen Zhang; Liangliang Cao; Jiguang Shen; Tom Gunter,; Albin Madappally Jose; Alexander Toshev; Jonathon Shlens; Ruoming Pang,; Yinfei Yang

arXiv:2301.13081·cs.CV·February 9, 2023

STAIR: Learning Sparse Text and Image Representation in Grounded Tokens

Chen Chen, Bowen Zhang, Liangliang Cao, Jiguang Shen, Tom Gunter,, Albin Madappally Jose, Alexander Toshev, Jonathon Shlens, Ruoming Pang,, Yinfei Yang

PDF

Open Access

TL;DR

This paper introduces STAIR, a sparse semantic representation for image and text retrieval that outperforms dense models like CLIP in accuracy while maintaining interpretability and ease of integration.

Contribution

The authors extend CLIP to create a sparse token space representation, achieving superior retrieval performance and interpretability compared to dense embeddings.

Findings

01

STAIR outperforms CLIP by 4.9% and 4.3% in Recall@1 on COCO-5k retrieval tasks.

02

STAIR achieves better results on ImageNet zero-shot classification.

03

Sparse representations can match or surpass dense embeddings in retrieval accuracy.

Abstract

Image and text retrieval is one of the foundational tasks in the vision and language domain with multiple real-world applications. State-of-the-art approaches, e.g. CLIP, ALIGN, represent images and texts as dense embeddings and calculate the similarity in the dense embedding space as the matching score. On the other hand, sparse semantic features like bag-of-words models are more interpretable, but believed to suffer from inferior accuracy than dense representations. In this work, we show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations. We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space. Each token in the space is a (sub-)word in the vocabulary, which is not only interpretable but also easy to integrate with existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsALIGN · Contrastive Language-Image Pre-training