Gaze-Informed Vision Transformers: Predicting Driving Decisions Under Uncertainty
Sharath Koorathota, Nikolas Papadopoulos, Jia Li Ma, Shruti Kumar,, Xiaoxiao Sun, Arunesh Mittal, Patrick Adelman, Paul Sajda

TL;DR
This paper enhances Vision Transformers for driving decision prediction by integrating human eye gaze data, introducing a novel loss function, and demonstrating improved accuracy and attention alignment under uncertain conditions.
Contribution
We introduce FAX loss to incorporate eye gaze into ViT, improving its focus and accuracy in driving scenarios with uncertainty, a novel approach in human-centered AI.
Findings
Gaze data improves ViT attention alignment with human focus.
FAX loss significantly boosts prediction accuracy under uncertainty.
Gaze-informed ViT outperforms baseline models in driving decision tasks.
Abstract
Vision Transformers (ViT) have advanced computer vision, yet their efficacy in complex tasks like driving remains less explored. This study enhances ViT by integrating human eye gaze, captured via eye-tracking, to increase prediction accuracy in driving scenarios under uncertainty in both real-world and virtual reality scenarios. First, we establish the significance of human eye gaze in left-right driving decisions, as observed in both human subjects and a ViT model. By comparing the similarity between human fixation maps and ViT attention weights, we reveal the dynamics of overlap across individual heads and layers. This overlap demonstrates that fixation data can guide the model in distributing its attention weights more effectively. We introduce the fixation-attention intersection (FAX) loss, a novel loss function that significantly improves ViT performance under high uncertainty…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Older Adults Driving Studies
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Dropout · Byte Pair Encoding · Adam · Position-Wise Feed-Forward Layer
