Show, Don't Tell: Evaluating Large Language Models Beyond Textual   Understanding with ChildPlay

Gon\c{c}alo Hora de Carvalho; Oscar Knap; Robert Pollice

arXiv:2407.11068·cs.AI·March 3, 2025

Show, Don't Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay

Gon\c{c}alo Hora de Carvalho, Oscar Knap, Robert Pollice

PDF

Open Access 1 Repo

TL;DR

This paper introduces ChildPlay, a benchmark suite to evaluate large language models' abilities beyond text, focusing on strategic, spatial, and chemical reasoning through simple ASCII-encoded games, revealing limited generalization.

Contribution

The paper presents a new benchmark set and evaluation methodology for assessing GPT models on non-linguistic tasks like strategy and spatial reasoning, highlighting their limitations.

Findings

01

GPT models perform poorly on strategic and spatial tasks.

02

Performance improves with larger models on some tasks.

03

Models struggle with chemistry-related ASCII graph interpretation.

Abstract

We developed a benchmark set to assess the generalization of state-of-the-art large language models on problems beyond linguistic tasks and evaluate it on a systematic progression of GPT models (GPT-3.5, GPT-4, GPT-4o, GPT-4o-mini). Using simple games like Tic-Tac-Toe, Connect Four, Battleship, and a Shape Recognition Game, all encoded in ASCII, we test strategic capabilities and spatial reasoning, core abilities any artificial intelligence would need to master for solving problems in chemistry. To probe generalization, we introduce two new games for spatial logic: LEGO Connect Language (LCL) and Guess-the-SMILES (GtS), a operationally simple chemistry benchmark. Our results show that GPT models provide meaningful responses for several tasks but, generally, perform poorly. A systematic performance progression with increased model capabilities (GPT-3.5, GPT-4, GPT-4o) is only observed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

child-play-neurips/child-play
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Sparse Evolutionary Training · Cosine Annealing · Label Smoothing · Linear Layer · Weight Decay · Softmax · Position-Wise Feed-Forward Layer