MGTR: End-to-End Mutual Gaze Detection with Transformer
Hang Guo, Zhengxi Hu, Jingtai Liu

TL;DR
MGTR introduces an end-to-end transformer-based framework for mutual gaze detection, improving speed and maintaining accuracy by jointly detecting heads and inferring gaze relationships in a single process.
Contribution
The paper presents a novel one-stage transformer-based approach for mutual gaze detection, streamlining the process and enhancing efficiency over traditional two-stage methods.
Findings
Accelerates mutual gaze detection without performance loss
Effectively captures semantic information at multiple levels
Demonstrates superior speed and accuracy on benchmark datasets
Abstract
People's looking at each other or mutual gaze is ubiquitous in our daily interactions, and detecting mutual gaze is of great significance for understanding human social scenes. Current mutual gaze detection methods focus on two-stage methods, whose inference speed is limited by the two-stage pipeline and the performance in the second stage is affected by the first one. In this paper, we propose a novel one-stage mutual gaze detection framework called Mutual Gaze TRansformer or MGTR to perform mutual gaze detection in an end-to-end manner. By designing mutual gaze instance triples, MGTR can detect each human head bounding box and simultaneously infer mutual gaze relationship based on global image information, which streamlines the whole process with simplicity. Experimental results on two mutual gaze datasets show that our method is able to accelerate mutual gaze detection process…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Indoor and Outdoor Localization Technologies · Hand Gesture Recognition Systems
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
