Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task

Brady Bhalla; Honglu Fan; Nancy Chen; Tony Yue YU

arXiv:2510.18315·cs.LG·October 22, 2025

Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task

Brady Bhalla, Honglu Fan, Nancy Chen, Tony Yue YU

PDF

Open Access 3 Reviews

TL;DR

This study shows that increasing the embedding dimension in transformer models enhances the internal world model's structure, interpretability, and robustness in a simple sorting task, despite high accuracy being achievable with small dimensions.

Contribution

It provides empirical evidence that larger embedding dimensions lead to more faithful and interpretable internal representations in transformer models performing sorting tasks.

Findings

01

Higher embedding dimensions improve internal representation quality.

02

Attention weights encode global token order monotonically.

03

Transpositions align with largest adjacent differences.

Abstract

We investigate how embedding dimension affects the emergence of an internal "world model" in a transformer trained with reinforcement learning to perform bubble-sort-style adjacent swaps. Models achieve high accuracy even with very small embedding dimensions, but larger dimensions yield more faithful, consistent, and robust internal representations. In particular, higher embedding dimensions strengthen the formation of structured internal representation and lead to better interpretability. After hundreds of experiments, we observe two consistent mechanisms: (1) the last row of the attention weight matrix monotonically encodes the global ordering of tokens; and (2) the selected transposition aligns with the largest adjacent difference of these encoded values. Our results provide quantitative evidence that transformers build structured internal world models and that model size improves…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The paper’s empirical findings are both interesting and somewhat surprising. They show that models trained via gradient descent can naturally converge to interpretable solutions when the model’s capacity is sufficiently large. While this observation is broadly consistent with previous work (e.g., [1]), it is demonstrated here in a new and perhaps more advanced setting. [1] Yang et al., Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models

Weaknesses

- The wording of this paper could be improved. The terms “world model” and “agent” are not clearly defined, which makes their use vague in a technical context. Typically, these terms are used in more advanced AI systems, such as robotics or large language models, and using them to refer to a transformer trained under reinforcement learning on synthetic tasks may be misleading. - Additionally, the experimental details are not sufficiently clear. It is difficult to understand the precise inputs a

Reviewer 02Rating 2Confidence 4

Strengths

- The exposition was clear. - I find it somewhat interesting that the LLMs did indeed seem to record all relevant information about the state in internal representations.

Weaknesses

- There were some omissions in the way the learning task was described. - Where the input tokens numeric or text? If they were already numeric, step one of the discovered algorithm could just be the identity function, or any monotonic transform? - The paper does not offer new evidence as to why increasing the embedding dimension leads to increased fidelity of state space representation, beyond observing that this is true. I would expect more discussion or exploration of why this might be the

Reviewer 03Rating 2Confidence 3

Strengths

1. Enhanced Representation Quality: Larger embedding dimensions strengthen structured internal representations—boosting global order encoding fidelity (non-inversion proportion reaches 87% for length-6 sequences) and sharpening swap decision rules (76–77% top-1 swap alignment). This goes beyond mere accuracy, enabling more robust and consistent world-model formation. 2. Strong Interpretability: The study identifies two clear, consistent mechanisms (global order in attention weights, largest ad

Weaknesses

1. Task Specificity: The study focuses solely on a simple bubble-sort-style adjacent swap task with small sequence lengths (6–8). Results may not generalize to complex algorithmic tasks (e.g., merge sort, graph algorithms) or real-world sequence tasks (e.g., text processing), limiting broader applicability. 2. Embedding Dimension Saturation: Beyond ~30 embedding dimensions, improvements in representation quality (non-inversion proportion, swap alignment) level off. This means excessive embeddi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning