Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following
Qiaomu Miao, Alexandros Graikos, Jingwei Zhang, Sounak Mondal, Minh, Hoai, Dimitris Samaras

TL;DR
This paper introduces a semi-supervised approach for gaze following that leverages a VQA model and a diffusion model to generate refined pseudo-annotations, significantly reducing the need for manual labeling.
Contribution
The authors propose a novel semi-supervised method combining VQA-based heatmaps and diffusion model refinement for gaze annotation, outperforming baselines and halving annotation requirements.
Findings
Outperforms simple pseudo-annotation baselines on GazeFollow dataset.
Reduces annotation effort by 50% when applied to VAT model.
Achieves state-of-the-art results on VideoAttentionTarget dataset.
Abstract
Training gaze following models requires a large number of images with gaze target coordinates annotated by human annotators, which is a laborious and inherently ambiguous process. We propose the first semi-supervised method for gaze following by introducing two novel priors to the task. We obtain the first prior using a large pretrained Visual Question Answering (VQA) model, where we compute Grad-CAM heatmaps by `prompting' the VQA model with a gaze following question. These heatmaps can be noisy and not suited for use in training. The need to refine these noisy annotations leads us to incorporate a second prior. We utilize a diffusion model trained on limited human annotations and modify the reverse sampling process to refine the Grad-CAM heatmaps. By tuning the diffusion process we achieve a trade-off between the human annotation prior and the VQA heatmap prior, which retains the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Human Pose and Action Recognition · Hand Gesture Recognition Systems
MethodsDiffusion · Heatmap
