Belief-Aware VLM Model for Human-like Reasoning
Anshul Nayak, Shahil Shaik, Yue Wang

TL;DR
This paper introduces a belief-aware VLM framework that enhances human-like reasoning by integrating memory retrieval and reinforcement learning, improving zero-shot performance on VQA tasks.
Contribution
It proposes a novel belief representation method using vector-based memory and reinforcement learning to improve reasoning in vision-language models.
Findings
Achieves consistent improvements over zero-shot baselines on VQA datasets.
Demonstrates the effectiveness of belief-aware reasoning in multimodal tasks.
Abstract
Traditional neural network models for intent inference rely heavily on observable states and struggle to generalize across diverse tasks and dynamic environments. Recent advances in Vision Language Models (VLMs) and Vision Language Action (VLA) models introduce common-sense reasoning through large-scale multimodal pretraining, enabling zero-shot performance across tasks. However, these models still lack explicit mechanisms to represent and update belief, limiting their ability to reason like humans or capture the evolving human intent over long-horizon. To address this, we propose a belief-aware VLM framework that integrates retrieval-based memory and reinforcement learning. Instead of learning an explicit belief model, we approximate belief using a vector-based memory that retrieves relevant multimodal context, which is incorporated into the VLM for reasoning. We further refine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
