Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion

Qingguo Hu; Ante Wang; Jia Song; Delai Qiu; Qingsong Liu; Jinsong Su

arXiv:2508.04453·cs.CV·August 7, 2025

Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion

Qingguo Hu, Ante Wang, Jia Song, Delai Qiu, Qingsong Liu, Jinsong Su

PDF

TL;DR

This paper introduces a causality-driven visual object completion task to enhance LVLMs' visual perception by self-improving through automated instance creation and trial-and-error learning, leading to significant performance gains.

Contribution

The paper proposes a novel self-improvement framework using causality-driven visual object completion to boost LVLMs' visual knowledge and reasoning capabilities.

Findings

01

Achieved an average of 5.4% and 4.0% improvements on specialized tasks with LLaVA models.

02

Demonstrated substantial gains across four challenging tasks and four benchmarks.

03

Utilized automated instance construction without human or GPT-4V assistance.

Abstract

Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A potential cause is the scarcity of visual knowledge in popular instruction-tuning corpora, resulting in inadequate visual perception and reasoning capabilities. To address this challenge, we introduce a self-improvement framework grounded in a novel visual knowledge-intensive task, \underline{C}ausality-driven \underline{V}isual object \underline{C}ompletion (CVC). This task requires LVLMs to infer the masked object in an image based on its \textit{causal} relationships with the other visible information. We first obtain rich examples cheaply through our automated instance construction pipeline, without relying on sophisticated LVLMs (\textit{e.g.},…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.