Towards Multimodal Large-Language Models for Parent-Child Interaction: A Focus on Joint Attention
Weiyan Shi, Viet Hai Le, Kenny Tsu Wei Choo

TL;DR
This paper evaluates the ability of Multimodal Large Language Models to interpret joint attention in parent-child interactions, highlighting current limitations and emphasizing the need for improved understanding of eye contact cues.
Contribution
It provides a benchmark analysis of MLLMs' performance on joint attention detection using annotated videos, revealing significant gaps in understanding child-initiated eye contact.
Findings
MLLMs struggle with nuanced joint attention cues
Current models lack detailed understanding of eye contact
Highlighting the importance of multimodal reasoning improvements
Abstract
Joint attention is a critical component of early speech-language development and a key indicator of effective parent-child interaction. However, research on detecting and analysing joint attention remains limited, particularly for Multimodal Large Language Models (MLLMs). This study evaluates MLLMs' ability to comprehend joint attention by analysing 26 parent-child interaction videos annotated by two speech-language pathologists. These annotations identify strong and poor joint attention segments, serving as benchmarks for evaluating the models' interpretive capabilities. Our findings reveal that current MLLMs struggle to accurately interpret joint attention due to a lack of nuanced understanding of child-initiated eye contact, a crucial component of joint attention dynamics. This study highlights the importance of incorporating detailed eye contact to enhance MLLMs' multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
