Chatting with Images for Introspective Visual Thinking

Junfei Wu; Jian Guan; Qiang Liu; Shu Wu; Liang Wang; Wei Wu; Tieniu Tan

arXiv:2602.11073·cs.CV·February 13, 2026

Chatting with Images for Introspective Visual Thinking

Junfei Wu, Jian Guan, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, Tieniu Tan

PDF

Open Access 2 Models

TL;DR

This paper introduces 'chatting with images', a novel framework for visual reasoning that uses language prompts to dynamically re-encode and manipulate multiple image regions, improving multi-image and video reasoning in vision-language models.

Contribution

The paper proposes a new paradigm and model, ViLaVT, that enhances visual reasoning by integrating language-guided feature modulation and dynamic encoding, trained with a curriculum of supervised and reinforcement learning.

Findings

01

ViLaVT outperforms existing models on eight benchmarks.

02

Significant improvements on multi-image and video spatial reasoning tasks.

03

Effective joint re-encoding enhances cross-modal alignment.

Abstract

Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications