Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy
Kechun Xu, Zhenjie Zhu, Anzhe Chen, Shuqi Zhao, Qing Huang, Yifei Yang, Haojian Lu, Rong Xiong, Masayoshi Tomizuka, Yue Wang

TL;DR
This paper introduces BayesVLA, a Bayesian factorization approach that decomposes vision-language-action policies to improve out-of-distribution generalization and instruction following in VLA models.
Contribution
It proposes a novel Bayesian factorization method that addresses modality imbalance and preserves generalization in VLA models, reducing reliance on external data and tuning.
Findings
Superior generalization to unseen instructions, objects, and environments.
Effectively mitigates shortcut learning and language forgetting.
Validated through information-theoretic analysis and extensive experiments.
Abstract
The pursuit of out-of-distribution generalization in Vision-Language-Action (VLA) models is often hindered by catastrophic forgetting of the Vision-Language Model (VLM) backbone during fine-tuning. While co-training with external reasoning data helps, it requires experienced tuning and data-related overhead. Beyond such external dependencies, we identify an intrinsic cause within VLA datasets: modality imbalance, where language diversity is much lower than visual and action diversity. This imbalance biases the model toward visual shortcuts and language forgetting. To address this, we introduce BayesVLA, a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify. This inherently preserves generalization and promotes instruction following. We further incorporate pre- and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Neurobiology of Language and Bilingualism
