Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models
Dhruba Ghosh, Yuhui Zhang, Ludwig Schmidt

TL;DR
This paper investigates the fine-grained visual knowledge capabilities of vision-language models, revealing key factors that influence their performance on detailed classification tasks and suggesting avenues for improvement.
Contribution
It identifies the impact of vision encoder quality and pretraining strategies on fine-grained knowledge in VLMs, providing new insights for enhancing their detailed visual understanding.
Findings
Better vision encoders improve fine-grained classification more than other benchmarks.
Pretraining stage is crucial for fine-grained performance, especially with unfrozen language models.
Using a superior LLM enhances all benchmark scores equally.
Abstract
Vision-language models (VLMs) have made substantial progress across a wide range of visual question answering benchmarks, spanning visual reasoning, document understanding, and multimodal dialogue. These improvements are evident in a wide range of VLMs built on a variety of base models, alignment architectures, and training data. However, recent works show that these models trail behind in traditional image classification benchmarks, which test fine-grained visual knowledge. We test a large number of recent VLMs on fine-grained classification benchmarks and identify potential factors in the disconnect between fine-grained knowledge and other vision benchmarks. Through a series of ablation experiments, we find that using a better LLM improves all benchmark scores equally, while a better vision encoder disproportionately improves fine-grained classification performance. Furthermore, we…
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper is well-written and easy to follow. * The paper is well-motivated, as the fine-grained perception capabilities is actually very important for a large multimodal models. * Both the evaluations and ablation studies are extenisve and solid.
* Despite the comprehensive experimentation, the paper lacks novel insights or contributions, no new benchmarks or novel methods were proposed. The work simply reuses existing benchmarks, reformulates them, and evaluates existing models. * LMMs are developing very rapidly, but the VLMs evaluated in the paper are outdated (e.g., LLaVA-1.5, Qwen2VL), and the insights provided may not be applicable to current VLMs. * The experimental findings do not bring new insights, and are similar to most commo
* This paper investigates an important problem: which components of MLLMs influence performance on fine-grained visual classification. This is a valuable and underexplored topic in the existing literature. * The work provides several useful insights into the fine-grained visual classification capabilities of modern MLLMs, which could guide future research in designing more effective multimodal models. * Although the paper includes numerous experiments, figures, and conclusions, they are well-o
* Although the authors investigate several factors influencing MLLM performance on fine-grained visual recognition benchmarks, the impact of data scale remains unexplored. For instance, how do different proportions of the LLaVA or Molmo datasets affect the final performance? Including such experiments would make the analysis more comprehensive. * The conclusions drawn in this work may be valuable to the research community. However, given that commercial models are typically trained on trillions
1. It is well-motivated to investigate VLMs on traditional image classification benchmarks, which test fine-grained visual knowledge of existing VLMs. 2. It ablates key differences between models that may contribute to fine-grained classification performance, providing some technical strategies for improving the performance.
1. The contribution of Chapter 3 is limited. The findings 1 and 2 have been discovered in [a]. Moreover, some typical fine-grained classification datasets, like CaltechUCSD Bird-200, Stanford Car-196, Stanford Dog-120, and FGVC-Aircraft are not included in the evaluation. 2. Some VLMs designed for FGVR are missing for comparison, like Finedefics [b] and DeepPerception [c]. 3. The techinal depth is limited. Although it provides a series of ablation studies, further analysis on the potential reaso
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
