LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Shijie Lian; Bin Yu; Xiaopeng Lin; Laurence T. Yang; Zhaolong Shen; Changti Wu; Yuzhuo Miao; Cong Huang; Kai Chen

arXiv:2601.15197·cs.AI·May 14, 2026

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen

PDF

1 Repo 2 Models

TL;DR

LangForce introduces a Bayesian framework with latent action queries to improve language grounding and out-of-distribution generalization in vision-language-action models for robot manipulation.

Contribution

It proposes a novel Bayesian decomposition approach with learnable latent queries to enhance instruction following and mitigate dataset bias.

Findings

01

Achieves 11.3% improvement on OOD benchmarks

02

Effectively penalizes vision-only shortcuts in policies

03

Significantly enhances generalization in complex tasks

Abstract

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zgc-embodyai/LangForce
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.