KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models
Zihao Zheng, Zhihao Mao, Maoliang Li, Jiayu Chen, Xinhao Sun, Zhaobo Zhang, Donggang Cao, Hong Mei, Xiang Chen

TL;DR
KERV introduces a kinematic-rectified speculative decoding framework that enhances the inference speed of vision-language-action models by integrating robotic kinematics, reducing computational costs while maintaining high success rates.
Contribution
The paper proposes a novel KERV framework combining kinematic prediction with speculative decoding to improve speed and accuracy in embodied VLA models.
Findings
Achieves 27% to 37% acceleration in inference speed.
Maintains nearly no loss in success rate across tasks.
Effectively reduces the need for costly re-inference.
Abstract
Vision-Language-Action (VLA) models build a token-domain robot control paradigm, yet suffer from low speed. Speculative Decoding (SD) is an optimization strategy that can boost inference speed. Two key issues emerge when integrating VLA and SD: first, SD relies on re-inference to address token errors, which is computationally expensive; second, to mitigate token errors, the acceptance threshold in SD requires careful adjustment. Existing works fail to address the above two issues effectively. Meanwhile, as the bridge between AI and the physical world, existing embodied intelligence has overlooked the application of robotic kinematics. To address these issues, we innovatively combine token-domain VLA models with kinematic-domain prediction for SD, proposing a kinematic-rectified SD framework named KERV. We employ a kinematics-based Kalman Filter to predict actions and compensate for SD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
