Democratizing Fine-grained Visual Recognition with Large Language Models
Mingxuan Liu, Subhankar Roy, Wenjing Li, Zhun Zhong, Nicu Sebe, Elisa, Ricci

TL;DR
This paper introduces FineR, a training-free method that uses large language models to perform fine-grained visual recognition by reasoning about category names from visual attributes, reducing reliance on expert annotations.
Contribution
The paper proposes FineR, a novel approach leveraging LLMs for fine-grained recognition without training, bridging image and language modalities through visual attributes.
Findings
FineR outperforms state-of-the-art FGVR models.
Effective in wild and new domains with limited annotations.
Reduces need for expert-labeled data.
Abstract
Identifying subordinate-level categories from images is a longstanding task in computer vision and is referred to as fine-grained visual recognition (FGVR). It has tremendous significance in real-world applications since an average layperson does not excel at differentiating species of birds or mushrooms due to subtle differences among the species. A major bottleneck in developing FGVR systems is caused by the need of high-quality paired expert annotations. To circumvent the need of expert knowledge we propose Fine-grained Semantic Category Reasoning (FineR) that internally leverages the world knowledge of large language models (LLMs) as a proxy in order to reason about fine-grained category names. In detail, to bridge the modality gap between images and LLM, we extract part-level visual attributes from images as text and feed that information to a LLM. Based on the visual attributes…
Peer Reviews
Decision·ICLR 2024 poster
This is a good paper, and the advantages are: 1. The utilization of LLM to vocabulary-free FGVR tasks is novel; 2. The paper is generally well-written and easy to follow; 3. Good performance in Table 1 4. The well utilization of LLM.
Advice: 1. Add the citation of some highly related missing works: (1) Transhp: image classification with hierarchical prompting; it also focuses on the fine-grained image classification task. Also, it takes advantage of the recently proposed prompting technique in CV. Is your used LLM better or prompting better? (2) V2L: Leveraging Vision and Vision-language Models into Large-scale Product Retrieval; it also focuses on the vision language model and fine-grained visual recognition. The paper is t
1. The presentation of this paper is clear and the proposed method is easy to follow. 2. This paper proposes to extract visual attributes from images into the large language models (LLM) for reasoning the fine-grained subcategory names, which is a promising way to alleviate the high need for expert annotations in fine-grained recognition.
The weaknesses are as follows: 1. The novelty of this paper should be further demonstrated. The proposed method seems an intuitive combination of existing large-scale models, such as the visual question answering model, large-language model and vision-language model, etc. Besides, extracting visual attributes from images for recognition is widely used in generalized zero-shot learning methods such as [a]. 2. The effectiveness of the proposed method should be further verified. The recognition
- This paper is well-written, easy-to-follow and well-presented. - A novel Pokemon dataset is proposed to benefit FGVC. - The proposed method is novel and interesting, which provides a new way to do training-free FGVC based on LLM. - The proposed method is experimentally better than existing state-of-the-art methods.
- Although the novelty and technique score is satisfactory for ICLR, a major issue is whether it is necessary to only leverage LLM for FGVC under the proposed setting. In fact, as far as the reviewer concerns, some other only learning prompt on VLM can already achieves more than 90% accuracy for zero-shot FGVC. For example: [1] Conditional Prompt Learning for Vision-Language Models. CVPR 2022. [2] Learning to Prompt for Vision-Language Models. IJCV 2022. In fact, this issue is critical, at le
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Digital Imaging for Blood Diseases
