HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies
Zhiying Du, Bei Liu, Yaobo Liang, Yichao Shen, Haidong Cao, Xiangyu Zheng, Zhiyuan Feng, Zuxuan Wu, Jiaolong Yang, Yu-Gang Jiang

TL;DR
HiMoE-VLA introduces a hierarchical mixture-of-experts architecture for vision-language-action models, effectively managing robotic data heterogeneity to improve generalization and performance across diverse robotic platforms.
Contribution
The paper proposes a novel hierarchical mixture-of-experts architecture specifically designed to handle heterogeneity in robotic demonstration data for vision-language-action models.
Findings
Achieves higher accuracy than existing baselines.
Demonstrates robust generalization across diverse robots.
Shows consistent performance improvements in simulation and real-world tests.
Abstract
The development of foundation models for embodied intelligence critically depends on access to large-scale, high-quality robot demonstration data. Recent approaches have sought to address this challenge by training on large collections of heterogeneous robotic datasets. However, unlike vision or language data, robotic demonstrations exhibit substantial heterogeneity across embodiments and action spaces as well as other prominent variations such as senor configurations and action control frequencies. The lack of explicit designs for handling such heterogeneity causes existing methods to struggle with integrating diverse factors, thereby limiting their generalization and leading to degraded performance when transferred to new settings. In this paper, we present HiMoE-VLA, a novel vision-language-action (VLA) framework tailored to effectively handle diverse robotic data with heterogeneity.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Domain Adaptation and Few-Shot Learning
