StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning
Xiaowen Sun, Matthias Kerzel, Mengdi Li, Xufeng Zhao, Paul Striker, Stefan Wermter

TL;DR
StateVLM is a novel vision-language model designed for robotic affordance reasoning, incorporating a new training strategy with auxiliary regression loss to improve object localization and state understanding.
Contribution
The paper introduces StateVLM, a state-aware vision-language model with a novel training method for numerical reasoning, and provides an open benchmark for object-state affordance reasoning.
Findings
ARL improves model performance by 1.6% on adapted benchmarks.
StateVLM with ARL achieves 5.2% higher performance on OSAR.
ARL enhances consistency in affordance reasoning tasks.
Abstract
Vision-language models (VLMs) have shown remarkable performance in various robotic tasks, as they can perceive visual information and understand natural language instructions. However, when applied to robotics, VLMs remain subject to a fundamental limitation inherent in large language models (LLMs): they struggle with numerical reasoning, particularly in object detection and object-state localization. To explore numerical reasoning as a regression task in VLMs, we propose a novel training strategy to adapt VLMs for object detection and object-state localization. This approach leverages box decoder outputs to compute an Auxiliary Regression Loss (ARL) during fine-tuning, while preserving standard sequence prediction at inference. We leverage this training strategy to develop StateVLM (State-aware Vision-Language Model), a novel model designed to perceive and learn fine-grained object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
