VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness
Qimao Chen, Fang Li, Shaoqing Xu, Zhiyi Lai, Zixun Xie, Yuechen Luo, Shengyin Jiang, Hanbing Li, Long Chen, Bing Wang, Yi Zhang, Zhi-Xin Yang

TL;DR
VILTA introduces a novel framework integrating a Vision Language Model directly into the training loop of autonomous driving systems, enabling the generation of diverse, challenging scenarios that improve safety and robustness in rare, critical situations.
Contribution
This work presents VILTA, a new approach that leverages VLMs for direct, fine-grained scenario editing within closed-loop training, surpassing traditional methods in diversity and challenge generation.
Findings
Enhanced safety and robustness of driving policies.
Improved handling of critical long-tail events.
Generation of diverse, plausible challenging scenarios.
Abstract
The safe deployment of autonomous driving (AD) systems is fundamentally hindered by the long-tail problem, where rare yet critical driving scenarios are severely underrepresented in real-world data. Existing solutions including safety-critical scenario generation and closed-loop learning often rely on rule-based heuristics, resampling methods and generative models learned from offline datasets, limiting their ability to produce diverse and novel challenges. While recent works leverage Vision Language Models (VLMs) to produce scene descriptions that guide a separate, downstream model in generating hazardous trajectories for agents, such two-stage framework constrains the generative potential of VLMs, as the diversity of the final trajectories is ultimately limited by the generalization ceiling of the downstream algorithm. To overcome these limitations, we introduce VILTA (VLM-In-the-Loop…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Adversarial Robustness in Machine Learning · Multimodal Machine Learning Applications
