TL;DR
GazeQwen is a lightweight, gaze-aware multimodal language model that effectively utilizes eye-gaze data for streaming video understanding, outperforming larger models with minimal additional parameters.
Contribution
Introduces GazeQwen, a parameter-efficient method for integrating gaze information into LLMs via hidden-state modulation and a compact gaze resampler.
Findings
GazeQwen achieves 63.9% accuracy on StreamGaze tasks.
It outperforms larger models and GPT-4o in open-source benchmarks.
Learning where to inject gaze is more effective than increasing model size.
Abstract
Current multimodal large language models (MLLMs) cannot effectively utilize eye-gaze information for video understanding, even when gaze cues are supplied via visual overlays or text descriptions. We introduce GazeQwen, a parameter efficient approach that equips an open-source MLLM with gaze awareness through hidden-state modulation. At its core is a compact gaze resampler (~1-5 M trainable parameters) that encodes V-JEPA 2.1 video features together with fixation-derived positional encodings and produces additive residuals injected into selected LLM decoder layers via forward hooks. An optional second training stage adds low-rank adapters (LoRA) to the LLM for tighter integration. Evaluated on all 10 tasks of the StreamGaze benchmark, GazeQwen reaches 63.9% accuracy, a +16.1 point gain over the same Qwen2.5-VL-7B backbone with gaze as visual prompts and +10.5 points over GPT-4o, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
