cVLA: Towards Efficient Camera-Space VLAs

Max Argus; Jelena Bratulic; Houman Masnavi; Maxim Velikanov; Nick Heppert; Abhinav Valada; Thomas Brox

arXiv:2507.02190·cs.RO·December 23, 2025

cVLA: Towards Efficient Camera-Space VLAs

Max Argus, Jelena Bratulic, Houman Masnavi, Maxim Velikanov, Nick Heppert, Abhinav Valada, Thomas Brox

PDF

TL;DR

This paper introduces cVLA, a lightweight, efficient Vision-Language-Action model that predicts robot trajectories directly from images, leveraging VLMs and depth data for improved sim-to-real transfer in robotic manipulation.

Contribution

The paper presents a novel VLA approach that predicts trajectory waypoints from images, enhancing efficiency and robot embodiment agnosticism compared to prior low-level control models.

Findings

01

Effective sim-to-real transfer demonstrated on real robots.

02

Model trained on simulated data performs well in real-world tasks.

03

Incorporating depth images and decoding strategies improves performance.

Abstract

Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive performance of Vision Language Models (VLMs) on 2D images to directly infer robot end-effector poses in image frame coordinates. Unlike prior VLA models that output low-level controls, our model predicts trajectory waypoints, making it both more efficient to train and robot embodiment agnostic. Despite its lightweight design, our next-token prediction architecture effectively learns meaningful and executable robot trajectories. We further explore the underutilized potential of incorporating depth images, inference-time techniques such as decoding strategies, and demonstration-conditioned action generation. Our model is trained on a simulated dataset and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.