What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity

Haoxi Li; Qinglin Hou; Jianfei Ma; Jinxiang Lai; Tao Han; Sikai Bai; Jingcai Guo; Jie Zhang; Song Guo

arXiv:2605.03782·cs.AI·May 6, 2026

What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity

Haoxi Li, Qinglin Hou, Jianfei Ma, Jinxiang Lai, Tao Han, Sikai Bai, Jingcai Guo, Jie Zhang, Song Guo

PDF

TL;DR

GLANCE is a framework that enhances visual-linguistic agents by using curiosity-driven exploration based on discrepancies between predicted and actual visual inputs, improving their ability to handle complex, sparse-reward tasks.

Contribution

It introduces a novel curiosity signal derived from linguistic and visual discrepancies, integrating reasoning and exploration for better generalization in VLM agents.

Findings

01

GLANCE improves exploration efficiency in sparse-reward environments.

02

Aligning internal world models with visual reality enhances task performance.

03

GLANCE outperforms baseline methods in complex agentic tasks.

Abstract

To navigate partially observable visual environments, recent VLM agents increasingly internalize world modeling capabilities into their policies via explicit CoT reasoning, enabling them to mentally simulate futures before acting. However, relying solely on passive reasoning over visited states is insufficient for sparse-reward tasks, as it lacks the epistemic drive to actively uncover the ``known unknown'' required for robust generalization. We ask: Can VLM agents actively find signals that challenge and refine their internal world model through curiosity-driven exploration? In this work, we propose GLANCE, a unified framework that bridges reasoning and exploration by grounding the agent's linguistic world model into the stable visual representations of an evolving target network. Crucially, GLANCE leverages the discrepancy between linguistic prediction and visual reality as an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.