Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following
Shijing Wang, Yaping Huang, Chaoqun Cui, David Wong, Yihua Cheng, Alexandros Neophytou, Hyung Jin Chang

TL;DR
This paper introduces a novel training approach to improve gaze reasoning in vision foundation models, significantly enhancing gaze following performance especially with non-salient targets.
Contribution
It proposes a head-conditioned local LoRA and an out-of-cone penalty to boost gaze reasoning capabilities in VFMs for gaze following.
Findings
Achieves state-of-the-art results on GazeFollow and VAT datasets.
Significant improvements when gaze targets are non-salient.
Provides insights for future gaze following research.
Abstract
Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler architectures while outperforming prior methods. However, we observe a key limitation of VFM-based approaches: while VFMs substantially improve scene understanding, they contribute little to gaze reasoning. As a result, existing methods often rely on semantically salient objects rather than true gaze cues, leading to degraded performance when targets are not salient. To address this, we propose a novel training mechanism to enhance gaze reasoning in VFMs for gaze following. Our method includes: (1) a head-conditioned local LoRA, which enables localized adaptation to preserve scene token learning while improving head token learning for gaze reasoning; and (2) an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
