Show, Don't Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay
Gon\c{c}alo Hora de Carvalho, Oscar Knap, Robert Pollice

TL;DR
This paper introduces ChildPlay, a benchmark suite to evaluate large language models' abilities beyond text, focusing on strategic, spatial, and chemical reasoning through simple ASCII-encoded games, revealing limited generalization.
Contribution
The paper presents a new benchmark set and evaluation methodology for assessing GPT models on non-linguistic tasks like strategy and spatial reasoning, highlighting their limitations.
Findings
GPT models perform poorly on strategic and spatial tasks.
Performance improves with larger models on some tasks.
Models struggle with chemistry-related ASCII graph interpretation.
Abstract
We developed a benchmark set to assess the generalization of state-of-the-art large language models on problems beyond linguistic tasks and evaluate it on a systematic progression of GPT models (GPT-3.5, GPT-4, GPT-4o, GPT-4o-mini). Using simple games like Tic-Tac-Toe, Connect Four, Battleship, and a Shape Recognition Game, all encoded in ASCII, we test strategic capabilities and spatial reasoning, core abilities any artificial intelligence would need to master for solving problems in chemistry. To probe generalization, we introduce two new games for spatial logic: LEGO Connect Language (LCL) and Guess-the-SMILES (GtS), a operationally simple chemistry benchmark. Our results show that GPT models provide meaningful responses for several tasks but, generally, perform poorly. A systematic performance progression with increased model capabilities (GPT-3.5, GPT-4, GPT-4o) is only observed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Sparse Evolutionary Training · Cosine Annealing · Label Smoothing · Linear Layer · Weight Decay · Softmax · Position-Wise Feed-Forward Layer
