Joint Gaze-Location and Gaze-Object Detection
Danyang Tu, Wei Shen, Wei Sun, Xiongkuo Min, Guangtao Zhai

TL;DR
This paper introduces GTR, a unified transformer-based model that jointly detects human gaze locations and gaze objects in a single end-to-end pipeline, significantly improving accuracy and efficiency over multi-stage methods.
Contribution
The paper presents GTR, the first unified transformer model for joint gaze location and object detection, streamlining gaze following detection into a single-stage, end-to-end framework.
Findings
Achieves 12.1 mAP gain on GazeFollowing dataset.
Improves 18.2 mAP on VideoAttentionTarget.
Increases FPS by over 9 times, especially with multiple people.
Abstract
This paper proposes an efficient and effective method for joint gaze location detection (GL-D) and gaze object detection (GO-D), \emph{i.e.}, gaze following detection. Current approaches frame GL-D and GO-D as two separate tasks, employing a multi-stage framework where human head crops must first be detected and then be fed into a subsequent GL-D sub-network, which is further followed by an additional object detector for GO-D. In contrast, we reframe the gaze following detection task as detecting human head locations and their gaze followings simultaneously, aiming at jointly detect human gaze location and gaze object in a unified and single-stage pipeline. To this end, we propose GTR, short for \underline{G}aze following detection \underline{TR}ansformer, streamlining the gaze following detection pipeline by eliminating all additional components, leading to the first unified paradigm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Brain Tumor Detection and Classification · Advanced Neural Network Applications
