HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies

Zhiying Du; Bei Liu; Yaobo Liang; Yichao Shen; Haidong Cao; Xiangyu Zheng; Zhiyuan Feng; Zuxuan Wu; Jiaolong Yang; Yu-Gang Jiang

arXiv:2512.05693·cs.RO·December 8, 2025

HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies

Zhiying Du, Bei Liu, Yaobo Liang, Yichao Shen, Haidong Cao, Xiangyu Zheng, Zhiyuan Feng, Zuxuan Wu, Jiaolong Yang, Yu-Gang Jiang

PDF

Open Access 6 Models 1 Datasets

TL;DR

HiMoE-VLA introduces a hierarchical mixture-of-experts architecture for vision-language-action models, effectively managing robotic data heterogeneity to improve generalization and performance across diverse robotic platforms.

Contribution

The paper proposes a novel hierarchical mixture-of-experts architecture specifically designed to handle heterogeneity in robotic demonstration data for vision-language-action models.

Findings

01

Achieves higher accuracy than existing baselines.

02

Demonstrates robust generalization across diverse robots.

03

Shows consistent performance improvements in simulation and real-world tests.

Abstract

The development of foundation models for embodied intelligence critically depends on access to large-scale, high-quality robot demonstration data. Recent approaches have sought to address this challenge by training on large collections of heterogeneous robotic datasets. However, unlike vision or language data, robotic demonstrations exhibit substantial heterogeneity across embodiments and action spaces as well as other prominent variations such as senor configurations and action control frequencies. The lack of explicit designs for handling such heterogeneity causes existing methods to struggle with integrating diverse factors, thereby limiting their generalization and leading to degraded performance when transferred to new settings. In this paper, we present HiMoE-VLA, a novel vision-language-action (VLA) framework tailored to effectively handle diverse robotic data with heterogeneity.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

ZhiyingDu/calvin_d_joint
dataset· 614 dl
614 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Domain Adaptation and Few-Shot Learning