Benchmarking LLMs for Mimicking Child-Caregiver Language in Interaction
Jing Liu, Abdellah Fourtassi

TL;DR
This study evaluates how well large language models can mimic child-caregiver interactions, revealing strengths in word-level approximation but limitations in capturing discursive patterns and diversity, aiming to guide future benchmarks.
Contribution
Introduces a benchmark for assessing LLMs' ability to simulate child-caregiver language, highlighting current models' strengths and gaps in interaction fidelity.
Findings
LLMs approximate dialogues at word and utterance levels
Models struggle with discursive patterns and diversity
State-of-the-art models exaggerate alignment
Abstract
LLMs can generate human-like dialogues, yet their ability to simulate early child-adult interactions remains largely unexplored. In this paper, we examined how effectively LLMs can capture the distinctive features of child-caregiver language in interaction, using both static and interactive benchmarking methods. We found that state-of-the-art LLMs like Llama 3 and GPT-4o can approximate child-caregiver dialogues at the word and utterance level, but they struggle to reproduce the child and caregiver's discursive patterns, exaggerate alignment, and fail to reach the level of diversity shown by humans. The broader goal of this work is to initiate the development of a comprehensive benchmark for LLMs in child-oriented applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Speech Recognition and Synthesis
MethodsLLaMA
