Towards Multimodal Large-Language Models for Parent-Child Interaction: A   Focus on Joint Attention

Weiyan Shi; Viet Hai Le; Kenny Tsu Wei Choo

arXiv:2502.19877·cs.HC·May 5, 2025

Towards Multimodal Large-Language Models for Parent-Child Interaction: A Focus on Joint Attention

Weiyan Shi, Viet Hai Le, Kenny Tsu Wei Choo

PDF

TL;DR

This paper evaluates the ability of Multimodal Large Language Models to interpret joint attention in parent-child interactions, highlighting current limitations and emphasizing the need for improved understanding of eye contact cues.

Contribution

It provides a benchmark analysis of MLLMs' performance on joint attention detection using annotated videos, revealing significant gaps in understanding child-initiated eye contact.

Findings

01

MLLMs struggle with nuanced joint attention cues

02

Current models lack detailed understanding of eye contact

03

Highlighting the importance of multimodal reasoning improvements

Abstract

Joint attention is a critical component of early speech-language development and a key indicator of effective parent-child interaction. However, research on detecting and analysing joint attention remains limited, particularly for Multimodal Large Language Models (MLLMs). This study evaluates MLLMs' ability to comprehend joint attention by analysing 26 parent-child interaction videos annotated by two speech-language pathologists. These annotations identify strong and poor joint attention segments, serving as benchmarks for evaluating the models' interpretive capabilities. Our findings reveal that current MLLMs struggle to accurately interpret joint attention due to a lack of nuanced understanding of child-initiated eye contact, a crucial component of joint attention dynamics. This study highlights the importance of incorporating detailed eye contact to enhance MLLMs' multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.