Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies

Cong Pang; Hongtao Yu; Zixuan Chen; Lewei Lu; Xin Lou

arXiv:2512.10384·cs.CV·December 12, 2025

Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies

Cong Pang, Hongtao Yu, Zixuan Chen, Lewei Lu, Xin Lou

PDF

Open Access

TL;DR

This paper introduces a new benchmark and optimization strategies for large vision-language models to enhance fine-grained recognition, addressing a gap in existing evaluation methods and demonstrating significant performance improvements.

Contribution

The paper presents the FROW benchmark for detailed evaluation of LVLMs and proposes data construction and training strategies to improve fine-grained recognition performance.

Findings

01

Mosaic data improves category recognition accuracy by 1%.

02

Open-world data boosts FROW accuracy by 10%-20%.

03

Fine-grained data in pre-training increases recognition accuracy by up to 10%.

Abstract

Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: \textit{data construction} and \textit{training process}, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques