GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding

Trong Thang Pham; Hien Nguyen; Ngan Le

arXiv:2603.25841·cs.CV·March 30, 2026

GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding

Trong Thang Pham, Hien Nguyen, Ngan Le

PDF

1 Repo

TL;DR

GazeQwen is a lightweight, gaze-aware multimodal language model that effectively utilizes eye-gaze data for streaming video understanding, outperforming larger models with minimal additional parameters.

Contribution

Introduces GazeQwen, a parameter-efficient method for integrating gaze information into LLMs via hidden-state modulation and a compact gaze resampler.

Findings

01

GazeQwen achieves 63.9% accuracy on StreamGaze tasks.

02

It outperforms larger models and GPT-4o in open-source benchmarks.

03

Learning where to inject gaze is more effective than increasing model size.

Abstract

Current multimodal large language models (MLLMs) cannot effectively utilize eye-gaze information for video understanding, even when gaze cues are supplied via visual overlays or text descriptions. We introduce GazeQwen, a parameter efficient approach that equips an open-source MLLM with gaze awareness through hidden-state modulation. At its core is a compact gaze resampler (~1-5 M trainable parameters) that encodes V-JEPA 2.1 video features together with fixation-derived positional encodings and produces additive residuals injected into selected LLM decoder layers via forward hooks. An optional second training stage adds low-rank adapters (LoRA) to the LLM for tighter integration. Evaluated on all 10 tasks of the StreamGaze benchmark, GazeQwen reaches 63.9% accuracy, a +16.1 point gain over the same Qwen2.5-VL-7B backbone with gaze as visual prompts and +10.5 points over GPT-4o, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

phamtrongthang123/gazeqwen
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.