Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs   Using the New York Times Connections Word Game

Prisha Samadarshi; Mariam Mustafa; Anushka Kulkarni; Raven Rothkopf,; Tuhin Chakrabarty; Smaranda Muresan

arXiv:2406.11012·cs.CL·October 15, 2024

Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game

Prisha Samadarshi, Mariam Mustafa, Anushka Kulkarni, Raven Rothkopf,, Tuhin Chakrabarty, Smaranda Muresan

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper evaluates the reasoning abilities of large language models using the New York Times Connections word game, revealing significant gaps compared to human players and establishing it as a challenging AI benchmark.

Contribution

It introduces a new benchmark based on the Connections game to assess LLMs' abstract reasoning and provides a taxonomy of knowledge types involved in the task.

Findings

01

Claude 3.5 Sonnet solves only 18% of games

02

Humans outperform LLMs, especially experts

03

LLMs struggle with encyclopedic and multiword knowledge

Abstract

The New York Times Connections game has emerged as a popular and challenging pursuit for word puzzle enthusiasts. We collect 438 Connections games to evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players. Our results show that even the best performing LLM, Claude 3.5 Sonnet, which has otherwise shown impressive reasoning abilities on a wide variety of benchmarks, can only fully solve 18% of the games. Novice and expert players perform better than Claude 3.5 Sonnet, with expert human players significantly outperforming it. We create a taxonomy of the knowledge types required to successfully cluster and categorize words in the Connections game. We find that while LLMs perform relatively well on categorizing words based on semantic relations they struggle with other types of knowledge such as Encyclopedic Knowledge, Multiword…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mustafamariam/llm-connections-solver
noneOfficial

Datasets

eric27n/NYT-Connections
dataset· 38 dl
38 dl

Videos

Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game· underline

Taxonomy

TopicsArtificial Intelligence in Law · Legal Education and Practice Innovations