Voila-A: Aligning Vision-Language Models with User's Gaze Attention

Kun Yan; Lei Ji; Zeyu Wang; Yuntao Wang; Nan Duan; Shuai Ma

arXiv:2401.09454·cs.CV·January 19, 2024·1 cites

Voila-A: Aligning Vision-Language Models with User's Gaze Attention

Kun Yan, Lei Ji, Zeyu Wang, Yuntao Wang, Nan Duan, Shuai Ma

PDF

Open Access 1 Video

TL;DR

Voila-A introduces a novel method to align vision-language models with human gaze attention using gaze data, improving interpretability and performance in real-world scenarios.

Contribution

The paper presents Voila-A, a new approach that incorporates gaze information into VLMs, including a dataset and modules for gaze alignment, enhancing model interpretability and effectiveness.

Findings

01

Voila-A significantly outperforms baseline models.

02

Gaze alignment improves model interpretability.

03

Gaze data can be effectively mimicked using localized narratives.

Abstract

In recent years, the integration of vision and language understanding has led to significant advancements in artificial intelligence, particularly through Vision-Language Models (VLMs). However, existing VLMs face challenges in handling real-world applications with complex scenes and multiple objects, as well as aligning their focus with the diverse attention patterns of human users. In this paper, we introduce gaze information, feasibly collected by AR or VR devices, as a proxy for human attention to guide VLMs and propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications. First, we collect hundreds of minutes of gaze data to demonstrate that we can mimic human gaze modalities using localized narratives. We then design an automatic data annotation pipeline utilizing GPT-4 to generate the VOILA-COCO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Voila-A: Aligning Vision-Language Models with User's Gaze Attention· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Gaze Tracking and Assistive Technology · Visual Attention and Saliency Detection

MethodsSparse Evolutionary Training · Multi-Head Attention · Attention Is All You Need · Absolute Position Encodings · Label Smoothing · Layer Normalization · Dropout · Adam · Linear Layer · Byte Pair Encoding