The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition

Yuwen Tan; Yuan Qing; Boqing Gong

arXiv:2505.24840·cs.CV·March 27, 2026

The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition

Yuwen Tan, Yuan Qing, Boqing Gong

PDF

Open Access 2 Models

TL;DR

This paper demonstrates that open-source vision LLMs lack hierarchical knowledge of biological taxonomies, which hampers their ability for hierarchical visual recognition, and shows that fine-tuning on VQA tasks can improve hierarchical consistency.

Contribution

It reveals the hierarchical knowledge gap in open-source vision LLMs and shows that fine-tuning on VQA tasks enhances their hierarchical visual recognition capabilities.

Findings

01

LLMs lack awareness of biological taxonomies.

02

Fine-tuning improves hierarchical consistency.

03

VQA tasks reveal LLMs' bottleneck effect.

Abstract

This paper reveals that many open-source large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs' hierarchical visual recognition (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs' bottleneck effect because the VQA tasks improve the LLMs' hierarchical consistency more than the vision LLMs'. We conjecture that one cannot make open-source vision LLMs understand visual concepts hierarchically until LLMs possess corresponding taxonomy knowledge.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Imaging for Blood Diseases · Biomedical Text Mining and Ontologies · Semantic Web and Ontologies