Automated Detection of Mutual Gaze and Joint Attention in Dual-Camera Settings via Dual-Stream Transformers

Jakub Kosmydel; Pawe{\l} Gajewski; Arkadiusz Bia{\l}ek

arXiv:2604.27105·cs.CV·May 1, 2026

Automated Detection of Mutual Gaze and Joint Attention in Dual-Camera Settings via Dual-Stream Transformers

Jakub Kosmydel, Pawe{\l} Gajewski, Arkadiusz Bia{\l}ek

PDF

TL;DR

This paper introduces a dual-stream Transformer model that automates the detection of mutual gaze and joint attention in dual-camera settings, aiding developmental psychology research.

Contribution

It presents a novel dual-stream Transformer architecture with gaze-aware backbones and token fusion, outperforming existing methods on caregiver-infant interaction data.

Findings

01

Model significantly outperforms convolutional baseline.

02

Model surpasses state-of-the-art multimodal LLM.

03

Open-sourced model and weights for broader use.

Abstract

Analyzing mutual gaze (MG) and joint attention (JA) is critical in developmental psychology but traditionally relies on labor-intensive manual coding. Automating this process in multi-camera laboratory settings is computationally challenging due to complex cross-camera relational dynamics. In this paper, we propose a highly efficient dual-stream Transformer architecture for detecting MG and JA from synchronized dual-camera recordings. Our approach leverages frozen gaze-aware backbones (GazeLLE) to extract rich visual priors, combined with a custom token fusion mechanism to map the spatial and semantic relationships between interacting dyads. Evaluated on an ecologically valid dataset of caregiver-infant interactions, our model exhibits good performance, significantly outperforming both a convolutional baseline and a state-of-the-art multimodal Large Language Model (LLM). By…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.