CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding
Wenxuan Song, Jiayi Chen, Pengxiang Ding, Yuxin Huang, Han Zhao, Donglin Wang, Haoang Li

TL;DR
This paper introduces CEED-VLA, a novel vision-language-action model that employs consistency distillation and early-exit decoding to significantly accelerate inference in robotic tasks without sacrificing performance.
Contribution
It proposes a new acceleration framework combining consistency distillation and early-exit decoding for VLA models in robotics, improving inference speed by over 4 times.
Findings
Achieves over 4x inference speedup across various benchmarks.
Maintains high task success rates in simulated and real-world robot tasks.
Provides a general paradigm for efficient multimodal decision-making in robotics.
Abstract
In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi decoding as a more efficient alternative to traditional autoregressive decoding, its practical benefits are marginal due to the lengthy iterations. To address it, we introduce consistency distillation training to predict multiple correct action tokens in each iteration, thereby achieving acceleration. Besides, we design mixed-label supervision to mitigate the error accumulation during distillation. Although distillation brings acceptable speedup, we identify that certain inefficient iterations…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- **Well-scoped training recipe.** Consistency loss + mixed AR loss with an explicit switch by δmax; concise algorithmic summary (Alg. 1). - **Early-exit decoding matters in practice.** Identifies long-tail “redundant iterations”; early-exit improves **min** and **avg** speed (Table 1). - **Compelling empirical gains.** 2.0× on CALVIN (tokens/s 79.2 vs 39.6) and 4.1× on LIBERO-Long; frequency up to 25.6 Hz with OpenVLA (Table 2).
1) **Loss formulation clarity.** Eq. (3) appears to optimize a KL between *Q*θ−(· | Y\*_i, s, l) and *Q*θ(· | Y_i, s, l), i.e., the numerator is conditioned on the **fixed-point token** Y\*_i while the denominator is conditioned on the **intermediate token** Y_i. Please supply a derivation that this realizes “map any Y on J to Y\*” rather than merely aligning two different conditionals; include why the stop-gradient teacher is the same student and whether this collapses to self-distillation wi
The acceleration results are quantitatively significant and consistently shown on two base models (LLaVA-VLA and OpenVLA).
- Lack of discussion on the necessity of autoregression. The paper’s motivation assumes the continued importance of AR decoding, but this assumption is not well-justified in the introduction. Since many modern heads (bi-directional attention, diffusion, or flow matching) are inherently parallel, the authors should explicitly discuss when and why AR decoding remains necessary for manipulation models. - Choice of baselines and comparison. The experiments are conducted on LLaVA-VLA and a chunk-bas
1. The motivation to accelerate RT-2 style VLA is very reasonable. The motivation is quiet clear. 2. Lots of ablations and design choice is verified during experiments.
1. Although the motivations of this paper are reasonable, the techniques employed are largely based on existing methods. The Jacobi decoding strategy was originally proposed for language models (https://arxiv.org/pdf/2403.00835), and many implementation details appear to be directly borrowed from that work. The primary contribution of this paper, therefore, lies in adapting an existing LLM acceleration method to the VLA setting. 2. Several existing works have explored alternative approaches to
1. The motivation is clear, and the idea is straightforward and reasonable; 2. The experiments show good speedup; 3. The ablation study is clear, showing the effects of different proposed components.
1. A question about the early exiting strategy: it seems that an exiting point $\sigma$ is predetermined and fixed before inference, and this exiting point decides a trade-off between performance and efficiency. - From my viewpoint, early exiting could be realized as a data-dependent strategy, which has been extensively studied in dynamic neural networks [1], including the fields of computer vision [2,3], LLM [4], or VLA [5]. - Therefore, I wonder about this design choice of the author
1. The paper applies consistency learning to VLAs, making them predict robot actions faster without changing the overall design. 2. It directly addresses the slow inference problem of VLAs and shows up to four times faster performance with minimal accuracy loss. 3. The experiments are thorough, including both simulated and real robot tests.
1. The central limitation of this paper lies in how it positions CEED-VLA within the broader landscape of non-autoregressive consistency-based policy generation, especially recent approaches that use diffusion or flow-matching architectures, such as Pi0.5, ManiCM, or FlowPolicy. These works have already demonstrated that consistency-style distillation can accelerate robot action inference by learning direct mappings from noisy or intermediate latent states to final actions. For example, FlowPoli
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
