Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual   Knowledge

Yaqi Zhao; Yuanyang Yin; Lin Li; Mingan Lin; Victor; Shea-Jay Huang; Siwei Chen; Weipeng Chen; Baoqun Yin; Zenan Zhou; and Wentao Zhang

arXiv:2411.16824·cs.CV·November 27, 2024

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

Yaqi Zhao, Yuanyang Yin, Lin Li, Mingan Lin, Victor, Shea-Jay Huang, Siwei Chen, Weipeng Chen, Baoqun Yin, Zenan Zhou, and Wentao Zhang

PDF

Open Access

TL;DR

This paper investigates the cognitive misalignment in large vision-language models caused by visual encoder limitations and proposes an entity-enhanced method to improve alignment and landmark recognition performance.

Contribution

It introduces a multi-granularity supervision approach, Entity-Enhanced Cognitive Alignment (EECA), to generate visually enriched tokens that better align with the LLM's cognitive framework.

Findings

01

VE-Unknown data hampers LVLM understanding

02

Rich visual features in VE-Known data improve alignment

03

EECA significantly boosts landmark recognition accuracy

Abstract

Does seeing always mean knowing? Large Vision-Language Models (LVLMs) integrate separately pre-trained vision and language components, often using CLIP-ViT as vision backbone. However, these models frequently encounter a core issue of "cognitive misalignment" between the vision encoder (VE) and the large language model (LLM). Specifically, the VE's representation of visual information may not fully align with LLM's cognitive framework, leading to a mismatch where visual features exceed the language model's interpretive range. To address this, we investigate how variations in VE representations influence LVLM comprehension, especially when the LLM faces VE-Unknown data-images whose ambiguous visual representations challenge the VE's interpretive precision. Accordingly, we construct a multi-granularity landmark dataset and systematically examine the impact of VE-Known and VE-Unknown data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

MethodsALIGN