TL;DR
LangForce introduces a Bayesian framework with latent action queries to improve language grounding and out-of-distribution generalization in vision-language-action models for robot manipulation.
Contribution
It proposes a novel Bayesian decomposition approach with learnable latent queries to enhance instruction following and mitigate dataset bias.
Findings
Achieves 11.3% improvement on OOD benchmarks
Effectively penalizes vision-only shortcuts in policies
Significantly enhances generalization in complex tasks
Abstract
Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
