A Survey on Efficient Vision-Language-Action Models

Zhaoshu Yu; Bo Wang; Pengpeng Zeng; Haonan Zhang; Ji Zhang; Zheng Wang; Lianli Gao; Jingkuan Song; Nicu Sebe; Heng Tao Shen

arXiv:2510.24795·cs.CV·February 3, 2026

A Survey on Efficient Vision-Language-Action Models

Zhaoshu Yu, Bo Wang, Pengpeng Zeng, Haonan Zhang, Ji Zhang, Zheng Wang, Lianli Gao, Jingkuan Song, Nicu Sebe, Heng Tao Shen

PDF

TL;DR

This survey comprehensively reviews efficient vision-language-action models, categorizing recent advancements into design, training, and data collection, to address computational challenges and guide future research in embodied intelligence.

Contribution

It introduces a unified taxonomy for efficient VLAs, consolidating diverse methods and providing a foundational reference for the community.

Findings

01

Categorizes techniques into three core pillars: design, training, data collection.

02

Summarizes state-of-the-art methods and applications.

03

Identifies key challenges and future research directions.

Abstract

Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. Despite their remarkable performance, foundational VLAs are hindered by the prohibitive computational and data demands inherent to their large-scale architectures. While a surge of recent research has focused on enhancing VLA efficiency, the field lacks a unified framework to consolidate these disparate advancements. To bridge this gap, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire model-training-data pipeline. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.