GazeFormer-MoE: Context-Aware Gaze Estimation via CLIP and MoE Transformer
Xinyuan Zhao, Xianrui Chen, Ahmad Chaddad

TL;DR
GazeFormer-MoE introduces a semantics-aware, multi-scale Transformer for 3D gaze estimation that leverages CLIP features, prototype conditioning, and Mixture of Experts to achieve state-of-the-art accuracy across multiple datasets.
Contribution
The paper proposes a novel, context-aware Transformer architecture with prototype conditioning and MoE, significantly improving 3D gaze estimation accuracy.
Findings
Achieves new state-of-the-art angular errors on multiple datasets.
Demonstrates up to 64% relative improvement over previous methods.
Shows ablation studies confirming the effectiveness of prototype conditioning, cross-scale fusion, and MoE.
Abstract
We present a semantics modulated, multi scale Transformer for 3D gaze estimation. Our model conditions CLIP global features with learnable prototype banks (illumination, head pose, background, direction), fuses these prototype-enriched global vectors with CLIP patch tokens and high-resolution CNN tokens in a unified attention space, and replaces several FFN blocks with routed/shared Mixture of Experts to increase conditional capacity. Evaluated on MPIIFaceGaze, EYEDIAP, Gaze360 and ETH-XGaze, our model achieves new state of the art angular errors of 2.49{\deg}, 3.22{\deg}, 10.16{\deg}, and 1.44{\deg}, demonstrating up to a 64% relative improvement over previously reported results. ablations attribute gains to prototype conditioning, cross scale fusion, MoE and hyperparameter. Our code is publicly available at https://github. com/AIPMLab/Gazeformer.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Hand Gesture Recognition Systems
