Pedestrian Crossing Intention Prediction Using Multimodal Fusion Network
Yuanzhe Li, Steffen M\"uller

TL;DR
This paper introduces a multimodal fusion network utilizing visual and motion data with Transformer modules and attention mechanisms to accurately predict pedestrian crossing intentions, enhancing autonomous vehicle safety.
Contribution
The paper presents a novel multimodal fusion network with depth-guided, modality, and temporal attention modules for improved pedestrian intention prediction.
Findings
Achieves superior performance on JAAD dataset
Effectively integrates multiple modalities for prediction
Outperforms baseline methods in accuracy
Abstract
Pedestrian crossing intention prediction is essential for the deployment of autonomous vehicles (AVs) in urban environments. Ideal prediction provides AVs with critical environmental cues, thereby reducing the risk of pedestrian-related collisions. However, the prediction task is challenging due to the diverse nature of pedestrian behavior and its dependence on multiple contextual factors. This paper proposes a multimodal fusion network that leverages seven modality features from both visual and motion branches, aiming to effectively extract and integrate complementary cues across different modalities. Specifically, motion and visual features are extracted from the raw inputs using multiple Transformer-based extraction modules. Depth-guided attention module leverages depth information to guide attention towards salient regions in another modality through comprehensive spatial feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Advanced Neural Network Applications · Multimodal Machine Learning Applications
