StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning

Xiaowen Sun; Matthias Kerzel; Mengdi Li; Xufeng Zhao; Paul Striker; Stefan Wermter

arXiv:2605.03927·cs.CV·May 6, 2026

StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning

Xiaowen Sun, Matthias Kerzel, Mengdi Li, Xufeng Zhao, Paul Striker, Stefan Wermter

PDF

TL;DR

StateVLM is a novel vision-language model designed for robotic affordance reasoning, incorporating a new training strategy with auxiliary regression loss to improve object localization and state understanding.

Contribution

The paper introduces StateVLM, a state-aware vision-language model with a novel training method for numerical reasoning, and provides an open benchmark for object-state affordance reasoning.

Findings

01

ARL improves model performance by 1.6% on adapted benchmarks.

02

StateVLM with ARL achieves 5.2% higher performance on OSAR.

03

ARL enhances consistency in affordance reasoning tasks.

Abstract

Vision-language models (VLMs) have shown remarkable performance in various robotic tasks, as they can perceive visual information and understand natural language instructions. However, when applied to robotics, VLMs remain subject to a fundamental limitation inherent in large language models (LLMs): they struggle with numerical reasoning, particularly in object detection and object-state localization. To explore numerical reasoning as a regression task in VLMs, we propose a novel training strategy to adapt VLMs for object detection and object-state localization. This approach leverages box decoder outputs to compute an Auxiliary Regression Loss (ARL) during fine-tuning, while preserving standard sequence prediction at inference. We leverage this training strategy to develop StateVLM (State-aware Vision-Language Model), a novel model designed to perceive and learn fine-grained object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.