Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following

Qiaomu Miao; Alexandros Graikos; Jingwei Zhang; Sounak Mondal; Minh; Hoai; Dimitris Samaras

arXiv:2406.02774·cs.CV·July 19, 2024

Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following

Qiaomu Miao, Alexandros Graikos, Jingwei Zhang, Sounak Mondal, Minh, Hoai, Dimitris Samaras

PDF

Open Access 1 Repo

TL;DR

This paper introduces a semi-supervised approach for gaze following that leverages a VQA model and a diffusion model to generate refined pseudo-annotations, significantly reducing the need for manual labeling.

Contribution

The authors propose a novel semi-supervised method combining VQA-based heatmaps and diffusion model refinement for gaze annotation, outperforming baselines and halving annotation requirements.

Findings

01

Outperforms simple pseudo-annotation baselines on GazeFollow dataset.

02

Reduces annotation effort by 50% when applied to VAT model.

03

Achieves state-of-the-art results on VideoAttentionTarget dataset.

Abstract

Training gaze following models requires a large number of images with gaze target coordinates annotated by human annotators, which is a laborious and inherently ambiguous process. We propose the first semi-supervised method for gaze following by introducing two novel priors to the task. We obtain the first prior using a large pretrained Visual Question Answering (VQA) model, where we compute Grad-CAM heatmaps by `prompting' the VQA model with a gaze following question. These heatmaps can be noisy and not suited for use in training. The need to refine these noisy annotations leads us to incorporate a second prior. We utilize a diffusion model trained on limited human annotations and modify the reverse sampling process to refine the Grad-CAM heatmaps. By tuning the diffusion process we achieve a trade-off between the human annotation prior and the VQA heatmap prior, which retains the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cvlab-stonybrook/gcdr-gaze
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology · Human Pose and Action Recognition · Hand Gesture Recognition Systems

MethodsDiffusion · Heatmap