VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving

Hyunki Seong; Seongwoo Moon; Hojin Ahn; Jehun Kang; David Hyunchul Shim

arXiv:2511.12405·cs.CV·November 18, 2025

VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving

Hyunki Seong, Seongwoo Moon, Hojin Ahn, Jehun Kang, David Hyunchul Shim

PDF

Open Access

TL;DR

VLA-R introduces a novel open-world autonomous driving framework that combines vision-language models, contrastive learning, and retrieval techniques to improve generalization and reasoning in unstructured environments.

Contribution

The paper proposes a new open-world end-to-end autonomous driving approach integrating vision-language retrieval and contrastive learning for better generalization.

Findings

01

Strong generalization in unseen environments

02

Effective open-world reasoning and action retrieval

03

Limited data still yields good performance

Abstract

Exploring open-world situations in an end-to-end manner is a promising yet challenging task due to the need for strong generalization capabilities. In particular, end-to-end autonomous driving in unstructured outdoor environments often encounters conditions that were unfamiliar during training. In this work, we present Vision-Language Action Retrieval (VLA-R), an open-world end-to-end autonomous driving (OW-E2EAD) framework that integrates open-world perception with a novel vision-action retrieval paradigm. We leverage a frozen vision-language model for open-world detection and segmentation to obtain multi-scale, prompt-guided, and interpretable perception features without domain-specific tuning. A Q-Former bottleneck aggregates fine-grained visual representations with language-aligned visual features, bridging perception and action domains. To learn transferable driving behaviors, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis