VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language   Large Models

Chenyu Zhou; Mengdan Zhang; Peixian Chen; Chaoyou Fu; Yunhang Shen,; Xiawu Zheng; Xing Sun; Rongrong Ji

arXiv:2406.10228·cs.CV·June 17, 2024·1 cites

VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

Chenyu Zhou, Mengdan Zhang, Peixian Chen, Chaoyou Fu, Yunhang Shen,, Xiawu Zheng, Xing Sun, Rongrong Ji

PDF

Open Access

TL;DR

This paper introduces the Interleaved Image-Text Comprehension (IITC) task and the VEGA dataset to evaluate and enhance vision-language models' ability to handle complex, misleading, and interleaved visual and textual information.

Contribution

The paper presents a new challenging IITC task, a specialized VEGA dataset for scientific content, and a multi-task training strategy to improve models' nuanced image-text comprehension capabilities.

Findings

01

Even top models like GPT4V achieve modest success on IITC.

02

Multi-task, multi-scale post-training improves image association accuracy to 85.8%.

03

The VEGA dataset effectively benchmarks and enhances MLLMs' comprehension skills.

Abstract

The swift progress of Multi-modal Large Models (MLLMs) has showcased their impressive ability to tackle tasks blending vision and language. Yet, most current models and benchmarks cater to scenarios with a narrow scope of visual and textual contexts. These models often fall short when faced with complex comprehension tasks, which involve navigating through a plethora of irrelevant and potentially misleading information in both text and image forms. To bridge this gap, we introduce a new, more demanding task known as Interleaved Image-Text Comprehension (IITC). This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions and to follow intricate instructions to pinpoint the relevant image. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Image Retrieval and Classification Techniques

MethodsSparse Evolutionary Training · VEGA