Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning

Manuel Israel Cazares

arXiv:2604.18897·cs.CL·April 22, 2026

Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning

Manuel Israel Cazares

PDF

1 Repo

TL;DR

This paper empirically investigates prompt engineering limits for mathematical reasoning in large language models, revealing a saturation point around 60-79% accuracy due to inherent problem complexity and model constraints.

Contribution

It systematically analyzes over 40 prompt variants across multiple models, identifying a fundamental ceiling in prompt-based reasoning performance and the factors causing it.

Findings

01

Performance plateaus at approximately 60-79% despite engineering efforts.

02

Model performance drops significantly with complex rule systems and larger prompts.

03

Prompt ordering effects can non-monotonically influence model accuracy.

Abstract

We present a systematic empirical study of prompt engineering for formal mathematical reasoning in the context of the SAIR Equational Theories Stage 1 competition. The task requires deciding whether one equational law implies another over all magmas -- a problem that is undecidable in general but decidable for FALSE via finite model search. Over five weeks, we designed, tested, and analyzed more than 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three language models (gpt-oss-120b, Llama 3.3 70B, Gemma 4 31B). Our central finding is a single-prompt ceiling: despite substantial engineering effort, balanced hard accuracy plateaus in an empirical saturation region of approximately 60--79% for gpt-oss-120b, compared to a 59.75% no-cheatsheet baseline. We identify three mechanisms underlying this ceiling: (1) the mathematical undecidability of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

israelcazares/sair-prompt-engineering
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.