cVLA: Towards Efficient Camera-Space VLAs
Max Argus, Jelena Bratulic, Houman Masnavi, Maxim Velikanov, Nick Heppert, Abhinav Valada, Thomas Brox

TL;DR
This paper introduces cVLA, a lightweight, efficient Vision-Language-Action model that predicts robot trajectories directly from images, leveraging VLMs and depth data for improved sim-to-real transfer in robotic manipulation.
Contribution
The paper presents a novel VLA approach that predicts trajectory waypoints from images, enhancing efficiency and robot embodiment agnosticism compared to prior low-level control models.
Findings
Effective sim-to-real transfer demonstrated on real robots.
Model trained on simulated data performs well in real-world tasks.
Incorporating depth images and decoding strategies improves performance.
Abstract
Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive performance of Vision Language Models (VLMs) on 2D images to directly infer robot end-effector poses in image frame coordinates. Unlike prior VLA models that output low-level controls, our model predicts trajectory waypoints, making it both more efficient to train and robot embodiment agnostic. Despite its lightweight design, our next-token prediction architecture effectively learns meaningful and executable robot trajectories. We further explore the underutilized potential of incorporating depth images, inference-time techniques such as decoding strategies, and demonstration-conditioned action generation. Our model is trained on a simulated dataset and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
