VLM-VPI: A Vision-Language Reasoning Framework for Improving Automated Vehicle-Pedestrian Interactions
Qingwen Pu, Kun Xie, and Yuxiang Liu

TL;DR
This paper presents VLM-VPI, a multimodal reasoning framework that enhances pedestrian intent understanding and safety in autonomous driving by integrating visual, kinematic, and demographic cues with large language models.
Contribution
The work introduces a novel multimodal reasoning system combining vision, language, and demographic data to improve pedestrian interaction safety in autonomous vehicles.
Findings
Achieves 92.3% intent classification accuracy in CARLA scenarios.
Reduces false alarms and conflict occurrences significantly in simulations.
Demographic-adaptive control further decreases conflicts for children and seniors.
Abstract
Autonomous driving systems often infer pedestrian yielding behavior from geometric and kinematic cues alone, limiting their ability to reason about visual scene context and age-dependent behavioral variability. This limitation can produce delayed interventions in safety-critical encounters and unnecessary braking in benign interactions. This work introduces Vision-Language Model-based Vehicle-Pedestrian Interaction (VLM-VPI), a multimodal reasoning framework for pedestrian intent understanding and yielding-aware control in autonomous driving. The system combines three components: a multimodal perception layer that captures visual and kinematic observations, a reasoning layer that uses Qwen3-VL 8B for visual scene understanding and GPT-OSS 20B for few-shot intent reasoning, and a tiered safety controller that applies age-specific braking margins for children, adults, and seniors. In 112…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
