Pedestrian Crossing Intent Prediction via Psychological Features and Transformer Fusion
Sima Ashayer, Hoang H. Nguyen, Yu Liang, Mina Sartipi

TL;DR
This paper introduces a lightweight, socially aware transformer-based model for pedestrian crossing intent prediction that effectively fuses behavioral features and quantifies uncertainty, outperforming existing models on benchmark datasets.
Contribution
The paper presents a novel, efficient transformer architecture that combines multiple behavioral streams and uncertainty estimation for improved pedestrian intent prediction.
Findings
Achieves 0.9 F1 and 0.94 AUC-ROC on PSI 1.0 benchmark.
Establishes a strong baseline of 0.78 F1 on PSI 2.0 dataset.
Selective prediction improves accuracy by 0.4% at 80% coverage.
Abstract
Pedestrian intention prediction needs to be accurate for autonomous vehicles to navigate safely in urban environments. We present a lightweight, socially informed architecture for pedestrian intention prediction. It fuses four behavioral streams (attention, position, situation, and interaction) using highway encoders, a compact 4-token Transformer, and global self-attention pooling. To quantify uncertainty, we incorporate two complementary heads: a variational bottleneck whose KL divergence captures epistemic uncertainty, and a Mahalanobis distance detector that identifies distributional shift. Together, these components yield calibrated probabilities and actionable risk scores without compromising efficiency. On the PSI 1.0 benchmark, our model outperforms recent vision language models by achieving 0.9 F1, 0.94 AUC-ROC, and 0.78 MCC by using only structured, interpretable features. On…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Advanced Neural Network Applications · Multimodal Machine Learning Applications
