VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation
Changhua Xu, Jie Lu, Junyu Xuan, En Yu

TL;DR
VGAS introduces a value-guided action selection framework for few-shot vision-language-action tasks, improving geometric precision and robustness in scarce data scenarios through a novel critic and regularization techniques.
Contribution
It proposes VGAS, a new inference-time selection method with a geometric critic and explicit regularization, enhancing few-shot VLA adaptation performance.
Findings
VGAS improves success rates in limited demonstration settings.
VGAS enhances robustness against distribution shifts.
The geometric critic and regularization stabilize action ranking.
Abstract
Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss action candidates lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation--selection} perspective and propose a novel framework \textbf{VGAS} (\textbf{V}alue-\textbf{G}uided \textbf{A}ction-chunk \textbf{S}election). It performs inference-time best-of- selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, \textbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the \textrm{Q-Chunk-Former}, a geometrically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
