Pedestrian Intention Prediction via Vision-Language Foundation Models

Mohsen Azarmi; Mahdi Rezaei; He Wang

arXiv:2507.04141·cs.CV·July 8, 2025

Pedestrian Intention Prediction via Vision-Language Foundation Models

Mohsen Azarmi, Mahdi Rezaei, He Wang

PDF

TL;DR

This paper demonstrates that vision-language foundation models, when guided by hierarchical prompts and contextual information, significantly improve pedestrian crossing intention prediction accuracy in autonomous driving scenarios.

Contribution

It introduces a novel approach using VLFMs with hierarchical prompts and automatic prompt engineering for better intention prediction.

Findings

01

Incorporating vehicle speed and its variations improves accuracy by 19.8%.

02

Automatic prompt engineering yields an additional 12.5% accuracy gain.

03

VLFMs outperform conventional vision-based models in generalization and context understanding.

Abstract

Prediction of pedestrian crossing intention is a critical function in autonomous vehicles. Conventional vision-based methods of crossing intention prediction often struggle with generalizability, context understanding, and causal reasoning. This study explores the potential of vision-language foundation models (VLFMs) for predicting pedestrian crossing intentions by integrating multimodal data through hierarchical prompt templates. The methodology incorporates contextual information, including visual frames, physical cues observations, and ego-vehicle dynamics, into systematically refined prompts to guide VLFMs effectively in intention prediction. Experiments were conducted on three common datasets-JAAD, PIE, and FU-PIP. Results demonstrate that incorporating vehicle speed, its variations over time, and time-conscious prompts significantly enhances the prediction accuracy up to 19.8%.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.