Vision and Language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning

Ross Greer; Maitrayee Keskar; Angel Martinez-Sanchez; Parthib Roy; Shashank Shriram; Mohan Trivedi

arXiv:2602.07680·cs.CV·February 19, 2026

Vision and Language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning

Ross Greer, Maitrayee Keskar, Angel Martinez-Sanchez, Parthib Roy, Shashank Shriram, Mohan Trivedi

PDF

Open Access

TL;DR

This paper explores how vision-language models can enhance autonomous driving safety by providing semantic hazard detection, improving planning with scene understanding, and incorporating natural language instructions for safer behavior.

Contribution

It introduces a hazard screening method using CLIP, analyzes vision-language embeddings in trajectory planning, and uses natural language constraints to improve safety in autonomous driving.

Findings

01

CLIP-based hazard screening detects diverse road hazards efficiently

02

Global scene embeddings alone do not improve trajectory accuracy

03

Natural language instructions reduce planning failures and enhance safety

Abstract

Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous driving. This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines. We study three complementary system-level use cases. First, we introduce a lightweight, category-agnostic hazard screening approach leveraging CLIP-based image-text similarity to produce a low-latency semantic hazard signal. This enables robust detection of diverse and out-of-distribution road hazards without explicit object detection or visual question answering. Second, we examine the integration of scene-level vision-language embeddings into a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutonomous Vehicle Technology and Safety · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning