GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths
Xianyu Chen, Ming Jiang, Qi Zhao

TL;DR
GazeXplain introduces a model that jointly predicts human visual scanpaths and generates natural language explanations, enhancing understanding of visual attention and cognitive processes across diverse datasets.
Contribution
It presents a novel attention-language decoder with semantic alignment and co-training for explainable scanpath prediction, bridging the gap between gaze prediction and explanation.
Findings
Effective in predicting scanpaths across datasets
Generates coherent natural language explanations
Improves understanding of visual attention mechanisms
Abstract
While exploring visual scenes, humans' scanpaths are driven by their underlying attention processes. Understanding visual scanpaths is essential for various applications. Traditional scanpath models predict the where and when of gaze shifts without providing explanations, creating a gap in understanding the rationale behind fixations. To bridge this gap, we introduce GazeXplain, a novel study of visual scanpath prediction and explanation. This involves annotating natural-language explanations for fixations across eye-tracking datasets and proposing a general model with an attention-language decoder that jointly predicts scanpaths and generates explanations. It integrates a unique semantic alignment mechanism to enhance the consistency between fixations and explanations, alongside a cross-dataset co-training approach for generalization. These novelties present a comprehensive and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Anomaly Detection Techniques and Applications · Topic Modeling
MethodsSoftmax · Attention Is All You Need
