Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion
Qingguo Hu, Ante Wang, Jia Song, Delai Qiu, Qingsong Liu, Jinsong Su

TL;DR
This paper introduces a causality-driven visual object completion task to enhance LVLMs' visual perception by self-improving through automated instance creation and trial-and-error learning, leading to significant performance gains.
Contribution
The paper proposes a novel self-improvement framework using causality-driven visual object completion to boost LVLMs' visual knowledge and reasoning capabilities.
Findings
Achieved an average of 5.4% and 4.0% improvements on specialized tasks with LLaVA models.
Demonstrated substantial gains across four challenging tasks and four benchmarks.
Utilized automated instance construction without human or GPT-4V assistance.
Abstract
Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A potential cause is the scarcity of visual knowledge in popular instruction-tuning corpora, resulting in inadequate visual perception and reasoning capabilities. To address this challenge, we introduce a self-improvement framework grounded in a novel visual knowledge-intensive task, \underline{C}ausality-driven \underline{V}isual object \underline{C}ompletion (CVC). This task requires LVLMs to infer the masked object in an image based on its \textit{causal} relationships with the other visible information. We first obtain rich examples cheaply through our automated instance construction pipeline, without relying on sophisticated LVLMs (\textit{e.g.},…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
