What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity
Haoxi Li, Qinglin Hou, Jianfei Ma, Jinxiang Lai, Tao Han, Sikai Bai, Jingcai Guo, Jie Zhang, Song Guo

TL;DR
GLANCE is a framework that enhances visual-linguistic agents by using curiosity-driven exploration based on discrepancies between predicted and actual visual inputs, improving their ability to handle complex, sparse-reward tasks.
Contribution
It introduces a novel curiosity signal derived from linguistic and visual discrepancies, integrating reasoning and exploration for better generalization in VLM agents.
Findings
GLANCE improves exploration efficiency in sparse-reward environments.
Aligning internal world models with visual reality enhances task performance.
GLANCE outperforms baseline methods in complex agentic tasks.
Abstract
To navigate partially observable visual environments, recent VLM agents increasingly internalize world modeling capabilities into their policies via explicit CoT reasoning, enabling them to mentally simulate futures before acting. However, relying solely on passive reasoning over visited states is insufficient for sparse-reward tasks, as it lacks the epistemic drive to actively uncover the ``known unknown'' required for robust generalization. We ask: Can VLM agents actively find signals that challenge and refine their internal world model through curiosity-driven exploration? In this work, we propose GLANCE, a unified framework that bridges reasoning and exploration by grounding the agent's linguistic world model into the stable visual representations of an evolving target network. Crucially, GLANCE leverages the discrepancy between linguistic prediction and visual reality as an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
