IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models
Parsa Madinei, Srijita Karmakar, Russell Cohen Hoffing, Felix Gervitz, Miguel P. Eckstein

TL;DR
IRIS is a training-free method that leverages real-time eye-tracking data to improve disambiguation in open-ended visual question answering with large vision-language models, significantly boosting accuracy.
Contribution
The paper introduces IRIS, a novel approach using inference-time eye-tracking data to resolve ambiguity in VQA without additional training, and provides a new benchmark dataset and evaluation tools.
Findings
Fixations near question start are most informative for disambiguation.
IRIS more than doubles accuracy on ambiguous questions (from 35.2% to 77.2%).
Consistent improvements across various state-of-the-art VLMs.
Abstract
We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
