Voila-A: Aligning Vision-Language Models with User's Gaze Attention
Kun Yan, Lei Ji, Zeyu Wang, Yuntao Wang, Nan Duan, Shuai Ma

TL;DR
Voila-A introduces a novel method to align vision-language models with human gaze attention using gaze data, improving interpretability and performance in real-world scenarios.
Contribution
The paper presents Voila-A, a new approach that incorporates gaze information into VLMs, including a dataset and modules for gaze alignment, enhancing model interpretability and effectiveness.
Findings
Voila-A significantly outperforms baseline models.
Gaze alignment improves model interpretability.
Gaze data can be effectively mimicked using localized narratives.
Abstract
In recent years, the integration of vision and language understanding has led to significant advancements in artificial intelligence, particularly through Vision-Language Models (VLMs). However, existing VLMs face challenges in handling real-world applications with complex scenes and multiple objects, as well as aligning their focus with the diverse attention patterns of human users. In this paper, we introduce gaze information, feasibly collected by AR or VR devices, as a proxy for human attention to guide VLMs and propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications. First, we collect hundreds of minutes of gaze data to demonstrate that we can mimic human gaze modalities using localized narratives. We then design an automatic data annotation pipeline utilizing GPT-4 to generate the VOILA-COCO…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Gaze Tracking and Assistive Technology · Visual Attention and Saliency Detection
MethodsSparse Evolutionary Training · Multi-Head Attention · Attention Is All You Need · Absolute Position Encodings · Label Smoothing · Layer Normalization · Dropout · Adam · Linear Layer · Byte Pair Encoding
