Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations

Syed Zohaib Hassan; P{\aa}l Halvorsen; Miriam S. Johnson; Pierre Lison

arXiv:2510.24250·cs.CL·October 29, 2025

Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations

Syed Zohaib Hassan, P{\aa}l Halvorsen, Miriam S. Johnson, Pierre Lison

PDF

TL;DR

This study evaluates how well five large language models generate authentic, age-appropriate child-like conversations in Norwegian, revealing challenges in modeling language suitable for children and highlighting data limitations.

Contribution

It provides a comparative analysis of LLMs' ability to produce age-appropriate dialogue for children, emphasizing the need for better training data for child language modeling.

Findings

01

GPT-4 and NorBloom-7b performed relatively well.

02

Models often generated language more advanced than children's speech.

03

High inter-rater reliability in evaluations (ICC=0.75).

Abstract

Large Language Models (LLMs), predominantly trained on adult conversational data, face significant challenges when generating authentic, child-like dialogue for specialized applications. We present a comparative study evaluating five different LLMs (GPT-4, RUTER-LLAMA-2-13b, GPTSW, NorMistral-7b, and NorBloom-7b) to generate age-appropriate Norwegian conversations for children aged 5 and 9 years. Through a blind evaluation by eleven education professionals using both real child interview data and LLM-generated text samples, we assessed authenticity and developmental appropriateness. Our results show that evaluators achieved strong inter-rater reliability (ICC=0.75) and demonstrated higher accuracy in age prediction for younger children (5-year-olds) compared to older children (9-year-olds). While GPT-4 and NorBloom-7b performed relatively well, most models generated language perceived…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.