OG-VLA: Orthographic Image Generation for 3D-Aware Vision-Language Action Model
Ishika Singh, Ankit Goyal, Stan Birchfield, Dieter Fox, Animesh Garg, Valts Blukis

TL;DR
OG-VLA introduces an innovative framework that combines 3D-aware robot policies with vision-language models, enhancing generalization to unseen environments and instructions in robotic manipulation tasks.
Contribution
The paper presents OG-VLA, a novel architecture that unprojects observations into canonical views and leverages foundation models to improve 3D-aware policy generalization.
Findings
Achieves over 40% relative improvement on benchmarks.
Demonstrates strong real-world generalization with minimal demonstrations.
Maintains robust performance in both seen and unseen environments.
Abstract
We introduce OG-VLA, a novel architecture and learning framework that combines the generalization strengths of Vision Language Action models (VLAs) with the robustness of 3D-aware policies. We address the challenge of mapping natural language instructions and one or more RGBD observations to quasi-static robot actions. 3D-aware robot policies achieve state-of-the-art performance on precise robot manipulation tasks, but struggle with generalization to unseen instructions, scenes, and objects. On the other hand, VLAs excel at generalizing across instructions and scenes, but can be sensitive to camera and robot pose variations. We leverage prior knowledge embedded in language and vision foundation models to improve generalization of 3D-aware keyframe policies. OG-VLA unprojects input observations from diverse views into a point cloud which is then rendered from canonical orthographic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Hand Gesture Recognition Systems
MethodsDiffusion
