ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models
Wencheng Ye, Tianshi Wang, Lei Zhu, Fengling Li, Guoli Yang, Hengtao Shen

TL;DR
ActDistill introduces an action-guided self-distillation framework that significantly reduces computation and latency in vision-language-action models while maintaining high performance, enabling more efficient robotic manipulation.
Contribution
It proposes a novel graph-structured distillation method guided by action priors, enabling lightweight VLA models with minimal performance loss.
Findings
Reduces computation by over 50%
Achieves up to 1.67 times speedup
Maintains or improves performance on benchmarks
Abstract
Recent Vision-Language-Action (VLA) models have shown impressive flexibility and generalization, yet their deployment in robotic manipulation remains limited by heavy computational overhead and inference latency. In this work, we present ActDistill, a general action-guided self-derived distillation framework that transfers the action prediction capability of any existing VLA model to a lightweight counterpart. Unlike previous efficiency strategies that primarily emphasize vision-language correlations, ActDistill leverages action priors to guide knowledge transfer and model compression, achieving action-oriented efficiency for VLA models. Specifically, we employ a well-trained VLA model as the teacher and introduce a graph-structured encapsulation strategy to explicitly model the hierarchical evolution of action prediction. The student model, derived from the graph-encapsulated teacher,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
