Evaluation and Enhancement of Semantic Grounding in Large   Vision-Language Models

Jiaying Lu; Jinmeng Rao; Kezhen Chen; Xiaoyuan Guo; Yawen Zhang,; Baochen Sun; Carl Yang; Jie Yang

arXiv:2309.04041·cs.CV·January 17, 2024·6 cites

Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models

Jiaying Lu, Jinmeng Rao, Kezhen Chen, Xiaoyuan Guo, Yawen Zhang,, Baochen Sun, Carl Yang, Jie Yang

PDF

Open Access

TL;DR

This paper evaluates the semantic grounding ability of large vision-language models, identifies prevalent issues, and proposes a data-centric tuning method to improve their connection between language and real-world entities.

Contribution

It introduces a comprehensive evaluation pipeline for semantic grounding and proposes a novel multimodal instruction tuning approach to enhance LVLMs' grounding capabilities.

Findings

01

Identified widespread misgrounding in LVLMs across various semantic aspects.

02

Developed a large-scale dataset for evaluating semantic grounding.

03

Enhanced LVLMs show significant improvements in grounding accuracy.

Abstract

Large Vision-Language Models (LVLMs) offer remarkable benefits for a variety of vision-language tasks. However, a challenge hindering their application in real-world scenarios, particularly regarding safety, robustness, and reliability, is their constrained semantic grounding ability, which pertains to connecting language to the physical-world entities or concepts referenced in images. Therefore, a crucial need arises for a comprehensive study to assess the semantic grounding ability of widely used LVLMs. Despite the significance, sufficient investigation in this direction is currently lacking. Our work bridges this gap by designing a pipeline for generating large-scale evaluation datasets covering fine-grained semantic information, such as color, number, material, etc., along with a thorough assessment of seven popular LVLMs' semantic grounding ability. Results highlight prevalent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

Methodsfail