African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification
Gregor Geigle, Radu Timofte, Goran Glava\v{s}

TL;DR
This paper introduces FOCI, a challenging benchmark for fine-grained object classification, revealing that current LVLMs underperform compared to CLIP models, highlighting the need for better fine-grained alignment in training.
Contribution
The paper creates a new benchmark, FOCI, for evaluating LVLMs on fine-grained classification, and demonstrates the performance gap with CLIP models, emphasizing the need for improved fine-grained training data.
Findings
CLIP models outperform LVLMs significantly on FOCI.
FOCI tests for a complementary skill to existing benchmarks.
LVLMs' image encoders lack fine-grained alignment.
Abstract
Recent Large Vision-Language Models (LVLMs) demonstrate impressive abilities on numerous image understanding and reasoning tasks. The task of fine-grained object classification (e.g., distinction between \textit{animal species}), however, has been probed insufficiently, despite its downstream importance. We fill this evaluation gap by creating \texttt{FOCI} (\textbf{F}ine-grained \textbf{O}bject \textbf{C}lass\textbf{I}fication), a difficult multiple-choice benchmark for fine-grained object classification, from existing object classification datasets: (1) multiple-choice avoids ambiguous answers associated with casting classification as open-ended QA task; (2) we retain classification difficulty by mining negative labels with a CLIP model. \texttt{FOCI}\xspace complements five popular classification datasets with four domain-specific subsets from ImageNet-21k. We benchmark 12 public…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
