TL;DR
LISA is a novel framework that combines frequency-domain priors with vision-language knowledge to improve driver gaze estimation robustness against lighting changes and noise.
Contribution
It introduces a dual-domain fusion mechanism and a training strategy to disentangle gaze features from appearance interference, enhancing accuracy and robustness.
Findings
Achieves state-of-the-art performance on two benchmarks.
Significantly improves robustness against occlusions and lighting variations.
Effectively separates gaze features from appearance interference.
Abstract
Driver gaze estimation serves as a fundamental metric for evaluating driver attentiveness in modern monitoring systems. Beyond being vulnerable to sudden lighting changes and sensor noise, spatial-domain models struggle to disentangle authentic gaze cues from irrelevant visual attributes. In this paper, we propose LISA, a \textbf{L}anguage-guided \textbf{I}nterference-aware \textbf{S}patial-Frequency \textbf{A}ttention framework that combines frequency-domain priors with vision-language knowledge. Observing that the amplitude spectrum remains relatively stable even under spatial perturbations, we design a dual-domain fusion mechanism. It integrates stable low-frequency semantics into high-frequency details, employing spatial attention to precisely target ocular regions. To reduce semantic ambiguity, we also introduce a training-time disentanglement strategy. Using a frozen CLIP encoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
