Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
Prisha Samadarshi, Mariam Mustafa, Anushka Kulkarni, Raven Rothkopf,, Tuhin Chakrabarty, Smaranda Muresan

TL;DR
This paper evaluates the reasoning abilities of large language models using the New York Times Connections word game, revealing significant gaps compared to human players and establishing it as a challenging AI benchmark.
Contribution
It introduces a new benchmark based on the Connections game to assess LLMs' abstract reasoning and provides a taxonomy of knowledge types involved in the task.
Findings
Claude 3.5 Sonnet solves only 18% of games
Humans outperform LLMs, especially experts
LLMs struggle with encyclopedic and multiword knowledge
Abstract
The New York Times Connections game has emerged as a popular and challenging pursuit for word puzzle enthusiasts. We collect 438 Connections games to evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players. Our results show that even the best performing LLM, Claude 3.5 Sonnet, which has otherwise shown impressive reasoning abilities on a wide variety of benchmarks, can only fully solve 18% of the games. Novice and expert players perform better than Claude 3.5 Sonnet, with expert human players significantly outperforming it. We create a taxonomy of the knowledge types required to successfully cluster and categorize words in the Connections game. We find that while LLMs perform relatively well on categorizing words based on semantic relations they struggle with other types of knowledge such as Encyclopedic Knowledge, Multiword…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsArtificial Intelligence in Law · Legal Education and Practice Innovations
