Gaze-Informed Vision Transformers: Predicting Driving Decisions Under   Uncertainty

Sharath Koorathota; Nikolas Papadopoulos; Jia Li Ma; Shruti Kumar,; Xiaoxiao Sun; Arunesh Mittal; Patrick Adelman; Paul Sajda

arXiv:2308.13969·cs.CV·January 13, 2025·2 cites

Gaze-Informed Vision Transformers: Predicting Driving Decisions Under Uncertainty

Sharath Koorathota, Nikolas Papadopoulos, Jia Li Ma, Shruti Kumar,, Xiaoxiao Sun, Arunesh Mittal, Patrick Adelman, Paul Sajda

PDF

Open Access 1 Repo

TL;DR

This paper enhances Vision Transformers for driving decision prediction by integrating human eye gaze data, introducing a novel loss function, and demonstrating improved accuracy and attention alignment under uncertain conditions.

Contribution

We introduce FAX loss to incorporate eye gaze into ViT, improving its focus and accuracy in driving scenarios with uncertainty, a novel approach in human-centered AI.

Findings

01

Gaze data improves ViT attention alignment with human focus.

02

FAX loss significantly boosts prediction accuracy under uncertainty.

03

Gaze-informed ViT outperforms baseline models in driving decision tasks.

Abstract

Vision Transformers (ViT) have advanced computer vision, yet their efficacy in complex tasks like driving remains less explored. This study enhances ViT by integrating human eye gaze, captured via eye-tracking, to increase prediction accuracy in driving scenarios under uncertainty in both real-world and virtual reality scenarios. First, we establish the significance of human eye gaze in left-right driving decisions, as observed in both human subjects and a ViT model. By comparing the similarity between human fixation maps and ViT attention weights, we reveal the dynamics of overlap across individual heads and layers. This overlap demonstrates that fixation data can guide the model in distributing its attention weights more effectively. We introduce the fixation-attention intersection (FAX) loss, a novel loss function that significantly improves ViT performance under high uncertainty…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

schko/fixatt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Older Adults Driving Studies

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Dropout · Byte Pair Encoding · Adam · Position-Wise Feed-Forward Layer