Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

Shijing Wang; Yaping Huang; Chaoqun Cui; David Wong; Yihua Cheng; Alexandros Neophytou; Hyung Jin Chang

arXiv:2605.22607·cs.CV·May 22, 2026

Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

Shijing Wang, Yaping Huang, Chaoqun Cui, David Wong, Yihua Cheng, Alexandros Neophytou, Hyung Jin Chang

PDF

TL;DR

This paper introduces a novel training approach to improve gaze reasoning in vision foundation models, significantly enhancing gaze following performance especially with non-salient targets.

Contribution

It proposes a head-conditioned local LoRA and an out-of-cone penalty to boost gaze reasoning capabilities in VFMs for gaze following.

Findings

01

Achieves state-of-the-art results on GazeFollow and VAT datasets.

02

Significant improvements when gaze targets are non-salient.

03

Provides insights for future gaze following research.

Abstract

Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler architectures while outperforming prior methods. However, we observe a key limitation of VFM-based approaches: while VFMs substantially improve scene understanding, they contribute little to gaze reasoning. As a result, existing methods often rely on semantically salient objects rather than true gaze cues, leading to degraded performance when targets are not salient. To address this, we propose a novel training mechanism to enhance gaze reasoning in VFMs for gaze following. Our method includes: (1) a head-conditioned local LoRA, which enables localized adaptation to preserve scene token learning while improving head token learning for gaze reasoning; and (2) an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.