GazeFormer-MoE: Context-Aware Gaze Estimation via CLIP and MoE Transformer

Xinyuan Zhao; Xianrui Chen; Ahmad Chaddad

arXiv:2601.12316·cs.CV·January 21, 2026

GazeFormer-MoE: Context-Aware Gaze Estimation via CLIP and MoE Transformer

Xinyuan Zhao, Xianrui Chen, Ahmad Chaddad

PDF

Open Access

TL;DR

GazeFormer-MoE introduces a semantics-aware, multi-scale Transformer for 3D gaze estimation that leverages CLIP features, prototype conditioning, and Mixture of Experts to achieve state-of-the-art accuracy across multiple datasets.

Contribution

The paper proposes a novel, context-aware Transformer architecture with prototype conditioning and MoE, significantly improving 3D gaze estimation accuracy.

Findings

01

Achieves new state-of-the-art angular errors on multiple datasets.

02

Demonstrates up to 64% relative improvement over previous methods.

03

Shows ablation studies confirming the effectiveness of prototype conditioning, cross-scale fusion, and MoE.

Abstract

We present a semantics modulated, multi scale Transformer for 3D gaze estimation. Our model conditions CLIP global features with learnable prototype banks (illumination, head pose, background, direction), fuses these prototype-enriched global vectors with CLIP patch tokens and high-resolution CNN tokens in a unified attention space, and replaces several FFN blocks with routed/shared Mixture of Experts to increase conditional capacity. Evaluated on MPIIFaceGaze, EYEDIAP, Gaze360 and ETH-XGaze, our model achieves new state of the art angular errors of 2.49{\deg}, 3.22{\deg}, 10.16{\deg}, and 1.44{\deg}, demonstrating up to a 64% relative improvement over previously reported results. ablations attribute gains to prototype conditioning, cross scale fusion, MoE and hyperparameter. Our code is publicly available at https://github. com/AIPMLab/Gazeformer.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Hand Gesture Recognition Systems