Recognize Anything: A Strong Image Tagging Model
Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo,, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, Yandong Guo, Lei, Zhang

TL;DR
The Recognize Anything Model (RAM) is a new foundation model for image tagging that achieves high zero-shot recognition accuracy across common categories by leveraging large-scale image-text data and a multi-step training process.
Contribution
RAM introduces a novel training paradigm for image tagging using automatic annotation and data cleaning, significantly improving zero-shot performance over existing models.
Findings
RAM outperforms CLIP and BLIP on multiple benchmarks.
RAM surpasses fully supervised methods in accuracy.
RAM achieves performance comparable to Google tagging API.
Abstract
We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations. The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training · BLIP: Bootstrapping Language-Image Pre-training
