TL;DR
This paper introduces Residual Semantic Steering (RSS), a novel probabilistic framework that enhances the robustness of vision-language-action models against linguistic perturbations by disentangling semantic intent from visual priors.
Contribution
The paper proposes RSS, a new framework with Monte Carlo Syntactic Integration and Residual Affordance Steering, to improve semantic understanding and robustness in VLA models.
Findings
RSS achieves state-of-the-art robustness across manipulation benchmarks.
It maintains performance under adversarial linguistic perturbations.
Theoretical analysis shows RSS maximizes mutual information between actions and intent.
Abstract
Vision-Language-Action (VLA) models have demonstrated impressive capabilities in generalized robotic control; however, they remain notoriously brittle to linguistic perturbations. We identify a critical ``modality collapse'' phenomenon where strong visual priors overwhelm sparse linguistic signals, causing agents to overfit to specific instruction phrasings while ignoring the underlying semantic intent. To address this, we propose Residual Semantic Steering (RSS), a probabilistic framework that disentangles physical affordance from semantic execution. RSS introduces two theoretical innovations: (1) Monte Carlo Syntactic Integration, which approximates the true semantic posterior via dense, LLM-driven distributional expansion, and (2) Residual Affordance Steering, a dual-stream decoding mechanism that explicitly isolates the causal influence of language by subtracting the visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
