No LLM Solved Yu Tsumura's 554th Problem

Simon Frieder; William Hart

arXiv:2508.03685·cs.LG·August 6, 2025

No LLM Solved Yu Tsumura's 554th Problem

Simon Frieder, William Hart

PDF

Open Access 4 Reviews

TL;DR

This paper demonstrates that despite recent successes, large language models still cannot solve certain complex mathematical problems like Yu Tsumura's 554th problem, highlighting limitations in their problem-solving capabilities.

Contribution

The paper identifies a specific IMO-level problem that LLMs cannot solve, challenging assumptions about their problem-solving proficiency.

Findings

01

LLMs cannot solve Yu Tsumura's 554th problem

02

The problem is within IMO proof complexity but remains unsolved by LLMs

03

The problem's solution is publicly available and likely in LLM training data

Abstract

We show, contrary to the optimism about LLM's problem-solving abilities, fueled by the recent gold medals that were attained, that a problem exists -- Yu Tsumura's 554th problem -- that a) is within the scope of an IMO problem in terms of proof sophistication, b) is not a combinatorics problem which has caused issues for LLMs, c) requires fewer proof techniques than typical hard IMO problems, d) has a publicly available solution (likely in the training data of LLMs), and e) that cannot be readily solved by any existing off-the-shelf LLM (commercial or open-source).

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. This paper identifies an interesting phenomenon where existing SOTA LLMs fail to solve a math problem that has a publicly available solution and is of moderate difficulty. 2. The paper's writing is clear, and its argument is explicit, making the authors' central claim easy to understand.

Weaknesses

1. The paper does not provide many reliable directions for this identified problem, such as improved training data or experiments. 2. The core contribution of the paper relies entirely on a single data point (one problem), which limits the generality of the conclusion about LLM reasoning "brittleness." 3. The authors acknowledge this specific problem will likely be "patched" by models according to Goodhart's Law, which limits the long-term contribution value of this specific finding.

Reviewer 02Rating 0Confidence 4

Strengths

The topic is interesting. It is always insightful to study and investigate the shortcomings of existing models. This may eventually lead to improving the models.

Weaknesses

Contribution of the paper is very narrow and limited in my view. I think the paper reads more like a blog post rather than a technical paper ready to be peer reviewed. I am not sure what authors expect from this review process. If the paper believes it has made a significant contribution, this might imply lack of familiarity with the literature and the CFP of ICLR. The experiments are weak only focusing on a single problem. The investigation is insightful, but limited. This investigation could

Reviewer 03Rating 0Confidence 5

Strengths

N.A.

Weaknesses

- **Extremely narrow empirical scope:** The central claim rests on one hand-picked problem. Even if illustrative, this is a qualitative case study rather than a robust empirical evaluation; conclusions about “systematic failure” risk overgeneralization. - **Human comparison is anecdotal (n=1):** The IMO participant case study is interesting but not a controlled experiment; it cannot support claims about why humans succeed where LLMs fail, nor establish that the task is broadly “within IMO reach.

Reviewer 04Rating 2Confidence 4

Strengths

- Yu Tsumura's problem is a nice problem that does satisfy the constraints outlined in the abstract. - Some counterargument against the buzz around IMO gold is appreciated.

Weaknesses

I have split my concerns in three separate categories: a critical weakness, major weaknesses, and minor weaknesses. **Critical Weakness** The contribution of the work is basically non-existent. The authors find a **single** problem that LLMs cannot solve. It is very unsurprising that such a problem can be found, since even existing benchmarks also still contain such problems. The analysis is very limited, apart from showing that the models cannot solve the problem. In particular, it is never di

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLogic, programming, and type systems · Cryptography and Residue Arithmetic · Cryptography and Data Security