Stable Language Guidance for Vision-Language-Action Models

Zhihao Zhan; Yuhao Chen; Jiaying Zhou; Qinhan Lyu; Hao Liu; Keze Wang; Liang Lin; Guangrun Wang

arXiv:2601.04052·cs.RO·April 21, 2026

Stable Language Guidance for Vision-Language-Action Models

Zhihao Zhan, Yuhao Chen, Jiaying Zhou, Qinhan Lyu, Hao Liu, Keze Wang, Liang Lin, Guangrun Wang

PDF

1 Repo

TL;DR

This paper introduces Residual Semantic Steering (RSS), a novel probabilistic framework that enhances the robustness of vision-language-action models against linguistic perturbations by disentangling semantic intent from visual priors.

Contribution

The paper proposes RSS, a new framework with Monte Carlo Syntactic Integration and Residual Affordance Steering, to improve semantic understanding and robustness in VLA models.

Findings

01

RSS achieves state-of-the-art robustness across manipulation benchmarks.

02

It maintains performance under adversarial linguistic perturbations.

03

Theoretical analysis shows RSS maximizes mutual information between actions and intent.

Abstract

Vision-Language-Action (VLA) models have demonstrated impressive capabilities in generalized robotic control; however, they remain notoriously brittle to linguistic perturbations. We identify a critical ``modality collapse'' phenomenon where strong visual priors overwhelm sparse linguistic signals, causing agents to overfit to specific instruction phrasings while ignoring the underlying semantic intent. To address this, we propose Residual Semantic Steering (RSS), a probabilistic framework that disentangles physical affordance from semantic execution. RSS introduces two theoretical innovations: (1) Monte Carlo Syntactic Integration, which approximates the true semantic posterior via dense, LLM-driven distributional expansion, and (2) Residual Affordance Steering, a dual-stream decoding mechanism that explicitly isolates the causal influence of language by subtracting the visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Doo-mon/RSS
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.