ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
Yuhao Zhou, Yunpeng Zhu, Yang Zhou, Jindi Lyu, Jian Lan, Zhangyuan Wang, Dan Si, Thomas Seidl, Qing Ye, Jiancheng Lyu

TL;DR
ForgeVLA introduces a federated learning framework for vision-language-action models that leverages distributed, unlabeled vision-action data without central aggregation or manual annotations, addressing data heterogeneity and feature collapse.
Contribution
The paper presents a novel federated training approach for VLA models that reconstructs language modality and mitigates feature collapse without sharing raw data.
Findings
ForgeVLA outperforms baseline methods on multiple benchmarks.
The method effectively reconstructs language from vision-action pairs.
Ablation studies confirm the importance of each component.
Abstract
Vision-Language-Action (VLA) models hold great promise for general-purpose robotic intelligence, yet scaling up such models is severely bottlenecked by the high cost of acquiring annotated training data. Fortunately, vision-equipped robots deployed across various domains already produce abundant vision-action pairs that can be leveraged to scale up VLA training more efficiently. However, these raw data cannot be centrally aggregated due to various constraints and also exhibit severe heterogeneity. To address these challenges, in this paper, we propose ForgeVLA, a federated VLA training framework that learns VLA models from distributed vision-action pairs without centralizing raw data or requiring manual annotations. Specifically, each client in ForgeVLA is equipped with an embodied instruction classifier that maps vision-action pairs to a predefined instruction set, recovering the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
