Discovery and usage of joint attention in images
Daniel Harari, Joshua B. Tenenbaum, Shimon Ullman

TL;DR
This paper introduces a novel method for detecting joint visual attention in static images, combining gaze direction and depth information, and demonstrates its potential for improving image understanding and captioning.
Contribution
The work presents the first algorithm to identify joint attention in single images and shows its relevance for understanding social interactions in images.
Findings
The algorithm accurately detects joint attention in static images.
Humans are sensitive to joint attention cues, supporting their use in image analysis.
Detection of joint attention can enhance image captioning and understanding.
Abstract
Joint visual attention is characterized by two or more individuals looking at a common target at the same time. The ability to identify joint attention in scenes, the people involved, and their common target, is fundamental to the understanding of social interactions, including others' intentions and goals. In this work we deal with the extraction of joint attention events, and the use of such events for image descriptions. The work makes two novel contributions. First, our extraction algorithm is the first which identifies joint visual attention in single static images. It computes 3D gaze direction, identifies the gaze target by combining gaze direction with a 3D depth map computed for the image, and identifies the common gaze target. Second, we use a human study to demonstrate the sensitivity of humans to joint attention, suggesting that the detection of such a configuration in an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Human Pose and Action Recognition
