GMGaze: MoE-Based Context-Aware Gaze Estimation with CLIP and Multiscale Transformer
Xinyuan Zhao, Yihang Wu, Ahmad Chaddad, Sarah A. Alkhodair, Reem Kateb

TL;DR
GMGaze introduces a multi-scale transformer with semantic prototype conditioning and Mixture-of-Experts modules for improved gaze estimation across domains.
Contribution
It proposes a novel GMGaze model that fuses features early and scales capacity efficiently, addressing key challenges in gaze estimation.
Findings
Outperforms previous methods on four public benchmarks.
Achieves state-of-the-art results in cross-domain evaluations.
Demonstrates effective capacity scaling with Mixture-of-Experts modules.
Abstract
Gaze estimation methods commonly use facial appearances to predict the direction of a person gaze. However, previous studies show three major challenges with convolutional neural network (CNN)-based, transformer-based, and contrastive language-image pre-training (CLIP)-based methods, including late fusion of image features, lack of factor-aware conditioning, and impractical capacity scaling. To address these challenges, we propose Globally-conditioned Multi-scale Gaze estimation (GMGaze), which leverages a multi-scale transformer architecture. Specifically, the model first introduces semantic prototype conditioning, which modulates the CLIP global image embedding using four learned prototype banks (i.e., illumination, background, head pose and appearance) to generate two complementary context-biased global tokens. These tokens, along with the CLIP patch and CNN tokens, are fused at the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
