AnyCamVLA: Zero-Shot Camera Adaptation for Viewpoint Robust Vision-Language-Action Models
Hyeongjun Heo, Seungyeon Woo, Sang Min Kim, Junho Kim, Junho Lee, Yonghyeon Lee, Young Min Kim

TL;DR
This paper introduces a zero-shot camera adaptation method for Vision-Language-Action models that improves viewpoint robustness in robotic manipulation without additional training or data, using real-time view synthesis.
Contribution
It proposes a novel zero-shot framework that virtually adjusts camera views at test time, eliminating the need for fine-tuning or extra data, and enhances model robustness in unstructured environments.
Findings
Outperforms data augmentation baselines on LIBERO benchmark
Improves viewpoint robustness in real-world robotic scenarios
Effective with various camera configurations and movements
Abstract
Despite remarkable progress in Vision-Language-Action models (VLAs) for robot manipulation, these large pre-trained models require fine-tuning to be deployed in specific environments. These fine-tuned models are highly sensitive to camera viewpoint changes that frequently occur in unstructured environments. In this paper, we propose a zero-shot camera adaptation framework without additional demonstration data, policy fine-tuning, or architectural modification. Our key idea is to virtually adjust test-time camera observations to match the training camera configuration in real-time. For that, we use a recent feed-forward novel view synthesis model which outputs high-quality target view images, handling both extrinsic and intrinsic parameters. This plug-and-play approach preserves the pre-trained capabilities of VLAs and applies to any RGB-based policy. Through extensive experiments on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning
