DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation
Zebin Yang, Yijiahao Qi, Tong Xie, Bo Yu, Shaoshan Liu, Meng Li

TL;DR
DySL-VLA introduces a dynamic layer-skipping framework for vision-language-action models in robotics, significantly reducing computational costs while maintaining high accuracy in manipulation tasks.
Contribution
The paper proposes DySL-VLA, a novel method that dynamically skips less important layers in VLA models based on action importance, improving efficiency without sacrificing accuracy.
Findings
Achieves 2.1% better success length than Deer-VLA.
Reduces trainable parameters by 85.7 times.
Provides 3.75x speedup at equal accuracy.
Abstract
Vision-Language-Action (VLA) models have shown remarkable success in robotic tasks like manipulation by fusing a language model's reasoning with a vision model's 3D understanding. However, their high computational cost remains a major obstacle for real-world applications that require real-time performance. We observe that the actions within a task have varying levels of importance: critical steps demand high precision, while less important ones can tolerate more variance. Leveraging this insight, we propose DySL-VLA, a novel framework that addresses computational cost by dynamically skipping VLA layers based on each action's importance. DySL-VLA categorizes its layers into two types: informative layers, which are consistently executed, and incremental layers, which can be selectively skipped. To intelligently skip layers without sacrificing accuracy, we invent a prior-post skipping…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper presents a simple algorithm for layer skipping based on an observation that some layers in the VLA are much more important than others and some actions during the trajectory are important and require being processed by the full VLA whereas other actions are more amenable to layer skipping. 2. The method shows good performance on 2 benchmarks on 2 different VLA architectures. 3. The paper also presents extensive ablations on every part of the method showing individual performance im
***1. Details on the thresholds*** How are $\eta_1$ and $\eta_2$ computed? If they are hyper-parameters, how do they change between models and simulators / tasks? In general, the paper lacks a bit of detail on how do they come up with the hyper-parameters used in the experiments and it would be nice to include that. ***2.Real world experiments?*** The paper does not show any real-world performance. It would be nice to see how it affects the inference speed in the real-world. ***3. Writing.
* **Motivation**: the approach tackles the interesting problem of improving VLA model latency and computational cost, which is relevant for robotics. * **Latency improvements**: Provides speedup and some accuracy improvements, with ablation studies and reproducibility details. * **Presentation** : the work is clearly presented and motivated. The observations and analysis in Section 3.1 provide useful insights into previous approaches and the proposed method.
* **Generalizability**: The decision to skip layers is based on a non-learned continuity calculation from action outputs, which may not generalize or be optimal. * **Worsens performance?**: while the authors show an improved latency of the model, it looks like the proposed skipping actually damages performance. It is only a 1% reduction on LIBERO, but it is unclear what's the impact on CALVIN, as the authors don't report OpenVLA-OFT number (base VLA model adopted) in the Calvin table.
- The paper is well written and presents extensive empirical evidence, such as activation similarity and layer significance in VLA settings to support the investigation of adaptive layer skipping. - To the best of my knowledge, the introduced algorithms for pre-skip prediction and two-stage knowledge distillation are novel in the context of VLA training. Both mechanisms are clearly motivated, theoretically sound, and directly address the challenges of inference speed, parameter efficiency, and m
- The method seems slightly convoluted and introduces several interacting components which come with their own set of hyperparameters respectively (e.g. static layer selection, continuity thresholds, trajectory window for continuity calculation, adapter architecture, controller thresholds, moving stride and number of training steps for each stage). While it was demonstrated to work on two different base VLA architectures and benchmarks, I am doubtful about the ease of adaptability to new tasks o
1. The paper leverages the continuity of the trajectory as an indicator of the importance of current actions, which is intuitive and aligns with robotic motion characteristics. 2. The proposed dynamic-static layer skipping is a well-motivated and technically sound idea to balance accuracy and efficiency by preserving informative layers while skipping redundant ones.
1. The evaluation falls short of demonstrating real-world applicability. Given the claim of improved efficiency for robotic deployment, a real-robot experiment on edge hardware (e.g., Jetson Orin) with latency and performance measurements would strongly support the paper’s claims. The current evaluation is limited to simulations and lacks evidence of generalization to diverse real-world scenarios. 2. Several method components rely on heuristics that may limit generality. For instance, defining s
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robot Manipulation and Learning
