NYT-Connections: A Deceptively Simple Text Classification Task that   Stumps System-1 Thinkers

Angel Yahir Loredo Lopez; Tyler McDonald; and Ali Emami

arXiv:2412.01621·cs.CL·February 26, 2025

NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers

Angel Yahir Loredo Lopez, Tyler McDonald, and Ali Emami

PDF

Open Access 1 Datasets

TL;DR

NYT-Connections is a new benchmark of simple word puzzles designed to challenge LLMs' reasoning skills beyond quick intuition, revealing significant performance gaps compared to humans and highlighting the limits of advanced prompting techniques.

Contribution

The paper introduces NYT-Connections, a novel reasoning benchmark that isolates fundamental skills and evaluates LLMs against humans across multiple configurations.

Findings

01

LLMs lag behind humans by nearly 30% on the benchmark

02

Advanced prompting techniques show limited gains as task difficulty increases

03

The benchmark resists intuitive shortcuts and is regularly updated to prevent data leakage

Abstract

Large Language Models (LLMs) have shown impressive performance on various benchmarks, yet their ability to engage in deliberate reasoning remains questionable. We present NYT-Connections, a collection of 358 simple word classification puzzles derived from the New York Times Connections game. This benchmark is designed to penalize quick, intuitive "System 1" thinking, isolating fundamental reasoning skills. We evaluated six recent LLMs, a simple machine learning heuristic, and humans across three configurations: single-attempt, multiple attempts without hints, and multiple attempts with contextual hints. Our findings reveal a significant performance gap: even top-performing LLMs like GPT-4 fall short of human performance by nearly 30%. Notably, advanced prompting techniques such as Chain-of-Thought and Self-Consistency show diminishing returns as task difficulty increases.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

tm21cy/NYT-Connections
dataset· 269 dl
269 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques · Biomedical Text Mining and Ontologies · Topic Modeling

MethodsAttention Is All You Need · Adam · Position-Wise Feed-Forward Layer · Linear Layer · Softmax · Multi-Head Attention · Byte Pair Encoding · Label Smoothing · Dropout · Dense Connections