OG-VLA: Orthographic Image Generation for 3D-Aware Vision-Language Action Model

Ishika Singh; Ankit Goyal; Stan Birchfield; Dieter Fox; Animesh Garg; Valts Blukis

arXiv:2506.01196·cs.RO·November 19, 2025

OG-VLA: Orthographic Image Generation for 3D-Aware Vision-Language Action Model

Ishika Singh, Ankit Goyal, Stan Birchfield, Dieter Fox, Animesh Garg, Valts Blukis

PDF

Open Access

TL;DR

OG-VLA introduces an innovative framework that combines 3D-aware robot policies with vision-language models, enhancing generalization to unseen environments and instructions in robotic manipulation tasks.

Contribution

The paper presents OG-VLA, a novel architecture that unprojects observations into canonical views and leverages foundation models to improve 3D-aware policy generalization.

Findings

01

Achieves over 40% relative improvement on benchmarks.

02

Demonstrates strong real-world generalization with minimal demonstrations.

03

Maintains robust performance in both seen and unseen environments.

Abstract

We introduce OG-VLA, a novel architecture and learning framework that combines the generalization strengths of Vision Language Action models (VLAs) with the robustness of 3D-aware policies. We address the challenge of mapping natural language instructions and one or more RGBD observations to quasi-static robot actions. 3D-aware robot policies achieve state-of-the-art performance on precise robot manipulation tasks, but struggle with generalization to unseen instructions, scenes, and objects. On the other hand, VLAs excel at generalizing across instructions and scenes, but can be sensitive to camera and robot pose variations. We leverage prior knowledge embedded in language and vision foundation models to improve generalization of 3D-aware keyframe policies. OG-VLA unprojects input observations from diverse views into a point cloud which is then rendered from canonical orthographic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Hand Gesture Recognition Systems

MethodsDiffusion