Think Proprioceptively: Embodied Visual Reasoning for VLA Manipulation
Fangyuan Wang, Peng Zhou, Jiaming Qi, Shipeng Lyu, David Navarro-Alarcon, Guodong Guo

TL;DR
ThinkProprio enhances robot manipulation by integrating proprioception early into vision-language models through text tokenization, improving reasoning and efficiency across multiple benchmarks.
Contribution
It introduces a novel method of encoding proprioception as text tokens for early fusion in VLMs, enabling better visual reasoning and reduced inference latency.
Findings
Text tokenization outperforms learned projectors for proprioception encoding.
Retaining 15% of visual tokens matches full token set performance.
Achieves over 50% reduction in inference latency.
Abstract
Vision-language-action (VLA) models typically inject proprioception only as a late conditioning signal, which prevents robot state from shaping instruction understanding and from influencing which visual tokens are attended throughout the policy. We introduce ThinkProprio, which converts proprioception into a sequence of text tokens in the VLM embedding space and fuses them with the task instruction at the input. This early fusion lets embodied state participate in subsequent visual reasoning and token selection, biasing computation toward action-critical evidence while suppressing redundant visual tokens. In a systematic ablation over proprioception encoding, state entry point, and action-head conditioning, we find that text tokenization is more effective than learned projectors, and that retaining roughly 15% of visual tokens can match the performance of using the full token set.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
