Human-AI Alignment of Multimodal Large Language Models with Speech-Language Pathologists in Parent-Child Interactions
Weiyan Shi, Kenny Tsu Wei Choo

TL;DR
This paper demonstrates how multimodal large language models can be aligned with speech-language pathologists to interpret parent-child interactions, achieving high accuracy in assessing joint attention through expert-informed prompts and structured behavioural cues.
Contribution
It introduces a two-stage MLLM system aligned with SLP reasoning, providing a novel approach to automated behavioural assessment in parent-child interactions.
Findings
MLLMs achieved up to 85% accuracy in cue extraction
Over 75% average precision in expert judgement simulation
Guidelines for designing SLP-aligned MLLM systems
Abstract
Joint attention is a critical marker of early social-communicative development, yet remains difficult for caregivers to assess without expert guidance. In this work, we explore how multimodal large language models (MLLMs) can be aligned with the reasoning processes of speech-language pathologists (SLPs) to support the interpretation of everyday parent-child interactions. We conducted in-depth interviews and video annotation studies with three experienced SLPs to uncover how they evaluate joint attention based on three core behavioural cues: gaze, action, and vocalisation. Using these insights, we developed a two-stage MLLM-based system that first extracts fine-grained behavioural descriptions from video segments and then judge joint attention quality using expert-aligned prompts. Our evaluation across 26 parent-child interaction videos shows that MLLMs can achieve up to 85% accuracy in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage Development and Disorders · Child and Animal Learning Development · Action Observation and Synchronization
