GazeMoE: Perception of Gaze Target with Mixture-of-Experts
Zhuangzhuang Dai, Zhongxi Lu, Vincent G. Zakka, Luis J. Manso, Jose M Alcaraz Calero, Chen Li

TL;DR
GazeMoE introduces a novel mixture-of-experts framework that enhances gaze target estimation by adaptively integrating multi-modal cues from foundation models, achieving state-of-the-art results in challenging scenarios.
Contribution
The paper presents GazeMoE, a new end-to-end model that leverages MoE modules and multi-modal cues for improved gaze estimation, addressing class imbalance and robustness issues.
Findings
Achieves state-of-the-art performance on benchmark datasets.
Effectively handles class imbalance with auxiliary loss.
Demonstrates robustness through strategic data augmentations.
Abstract
Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues -- including eyes, head poses, gestures, and contextual features -- demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Multimodal Machine Learning Applications
