A Knapsack by Any Other Name: Presentation impacts LLM performance on NP-hard problems
Alex Duchnowski, Ellie Pavlick, Alexander Koller

TL;DR
This paper introduces EHOP, a dataset of NP-hard problems in natural language, revealing that LLMs perform better on textbook problems than real-life or inverted variants, highlighting their dependence on training data and limited robustness.
Contribution
The paper presents EHOP, a novel dataset of diverse NP-hard problems in natural language, and demonstrates how presentation affects LLM performance and reasoning robustness.
Findings
LLMs perform better on textbook problems than real-life or inverted variants.
Reasoning models show high variance across different problem presentations.
LLMs depend heavily on training data and struggle with generalization.
Abstract
To investigate the effect of problem presentation on LLMs' ability to solve optimization problems, we introduce the dataset of Everyday Hard Optimization Problems (EHOP), a collection of NP-hard problems expressed in natural language. EHOP includes problem formulations that could be found in computer science textbooks (e.g., graph coloring), versions that are dressed up as problems that could arise in real life (e.g., party planning), and variants with inverted rules. We find that state-of-the-art LLMs, across multiple prompting strategies, systematically solve textbook problems more accurately than their real-life and inverted counterparts. While reasoning models are more capable, they nonetheless show high variance across problem presentations, suggesting they lack a truly robust reasoning mechanism. We argue that this constitutes evidence that LLMs are still heavily dependent on what…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsScheduling and Optimization Algorithms
