Illusions of reflection: open-ended task reveals systematic failures in Large Language Models' reflective reasoning

Sion Weatherhead; Flora Salim; Aaron Belbasis

arXiv:2510.18254·cs.AI·October 24, 2025

Illusions of reflection: open-ended task reveals systematic failures in Large Language Models' reflective reasoning

Sion Weatherhead, Flora Salim, Aaron Belbasis

PDF

Open Access

TL;DR

This paper critically evaluates whether large language models genuinely perform reflective reasoning by testing their ability to produce and revise scientific test items under real-world constraints, revealing significant limitations in current models.

Contribution

The study introduces an open-ended, rule-constrained task to assess LLMs' reflective reasoning, demonstrating their limited capacity for goal-driven self-correction and the need for external constraint enforcement.

Findings

01

First-pass performance is often poor, with models producing few valid items.

02

Reflection yields modest improvements, often due to chance rather than true correction.

03

Model performance deteriorates with increased open-endedness, showing no advantage for reasoning-optimized models.

Abstract

Humans do not just find mistakes after the fact -- we often catch them mid-stream because 'reflection' is tied to the goal and its constraints. Today's large language models produce reasoning tokens and 'reflective' text, but is it functionally equivalent with human reflective reasoning? Prior work on closed-ended tasks -- with clear, external 'correctness' signals -- can make 'reflection' look effective while masking limits in self-correction. We therefore test eight frontier models on a simple, real-world task that is open-ended yet rule-constrained, with auditable success criteria: to produce valid scientific test items, then revise after considering their own critique. First-pass performance is poor (often zero valid items out of 4 required; mean $\approx$ 1), and reflection yields only modest gains (also $\approx$ 1). Crucially, the second attempt frequently repeats the same…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications