African or European Swallow? Benchmarking Large Vision-Language Models   for Fine-Grained Object Classification

Gregor Geigle; Radu Timofte; Goran Glava\v{s}

arXiv:2406.14496·cs.CV·June 21, 2024

African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification

Gregor Geigle, Radu Timofte, Goran Glava\v{s}

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces FOCI, a challenging benchmark for fine-grained object classification, revealing that current LVLMs underperform compared to CLIP models, highlighting the need for better fine-grained alignment in training.

Contribution

The paper creates a new benchmark, FOCI, for evaluating LVLMs on fine-grained classification, and demonstrates the performance gap with CLIP models, emphasizing the need for improved fine-grained training data.

Findings

01

CLIP models outperform LVLMs significantly on FOCI.

02

FOCI tests for a complementary skill to existing benchmarks.

03

LVLMs' image encoders lack fine-grained alignment.

Abstract

Recent Large Vision-Language Models (LVLMs) demonstrate impressive abilities on numerous image understanding and reasoning tasks. The task of fine-grained object classification (e.g., distinction between \textit{animal species}), however, has been probed insufficiently, despite its downstream importance. We fill this evaluation gap by creating \texttt{FOCI} (\textbf{F}ine-grained \textbf{O}bject \textbf{C}lass\textbf{I}fication), a difficult multiple-choice benchmark for fine-grained object classification, from existing object classification datasets: (1) multiple-choice avoids ambiguous answers associated with casting classification as open-ended QA task; (2) we retain classification difficulty by mining negative labels with a CLIP model. \texttt{FOCI}\xspace complements five popular classification datasets with four domain-specific subsets from ImageNet-21k. We benchmark 12 public…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gregor-ge/foci-benchmark
pytorchOfficial

Videos

African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification· underline

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training