Recognize Anything: A Strong Image Tagging Model

Youcai Zhang; Xinyu Huang; Jinyu Ma; Zhaoyang Li; Zhaochuan Luo,; Yanchun Xie; Yuzhuo Qin; Tong Luo; Yaqian Li; Shilong Liu; Yandong Guo; Lei; Zhang

arXiv:2306.03514·cs.CV·June 12, 2023·21 cites

Recognize Anything: A Strong Image Tagging Model

Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo,, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, Yandong Guo, Lei, Zhang

PDF

Open Access 2 Repos 5 Models 2 Datasets

TL;DR

The Recognize Anything Model (RAM) is a new foundation model for image tagging that achieves high zero-shot recognition accuracy across common categories by leveraging large-scale image-text data and a multi-step training process.

Contribution

RAM introduces a novel training paradigm for image tagging using automatic annotation and data cleaning, significantly improving zero-shot performance over existing models.

Findings

01

RAM outperforms CLIP and BLIP on multiple benchmarks.

02

RAM surpasses fully supervised methods in accuracy.

03

RAM achieves performance comparable to Google tagging API.

Abstract

We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsContrastive Language-Image Pre-training · BLIP: Bootstrapping Language-Image Pre-training