Chatting with Images for Introspective Visual Thinking
Junfei Wu, Jian Guan, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, Tieniu Tan

TL;DR
This paper introduces 'chatting with images', a novel framework for visual reasoning that uses language prompts to dynamically re-encode and manipulate multiple image regions, improving multi-image and video reasoning in vision-language models.
Contribution
The paper proposes a new paradigm and model, ViLaVT, that enhances visual reasoning by integrating language-guided feature modulation and dynamic encoding, trained with a curriculum of supervised and reinforcement learning.
Findings
ViLaVT outperforms existing models on eight benchmarks.
Significant improvements on multi-image and video spatial reasoning tasks.
Effective joint re-encoding enhances cross-modal alignment.
Abstract
Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
