TL;DR
This paper introduces ChildAgentEval, a new benchmark inspired by child development tests, to assess how well AI agents mimic human cognitive abilities at different ages.
Contribution
It presents the first psychometrically grounded benchmark for evaluating cognitive age alignment in multimodal large language model-based AI agents.
Findings
Reveals gaps in AI agents' ability to simulate age-specific cognition
Provides a systematic comparison of AI reasoning with human developmental stages
Highlights areas where AI agents need improvement to match human cognitive levels
Abstract
While agentic AI and its core multimodal large language models (MLLMs) have demonstrated remarkable promise in language and visual reasoning across domains ranging from daily life to advanced scientific research, a profound gap remains between artificial and human intelligence. Despite the integration of powerful tools and advanced MLLMs, state-of-the-art AI agents frequently fail at foundational, seemingly simple tasks that a child can resolve with ease. Inspired by the Wechsler Intelligence Scale for Children (WISC), we introduce ChildAgentEval, the first psychometrically grounded interactive benchmark for evaluating cognitive age alignment in MLLM-based agents. ChildAgentEval systematically compares the reasoning performance of various MLLM-based interactive agents against age-specific human developmental stages, exposing where current agentic AI systems can and cannot simulate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
