Verification of the Implicit World Model in a Generative Model via Adversarial Sequences
Andr\'as Balogh, M\'ark Jelasity

TL;DR
This paper introduces adversarial sequence generation to verify the soundness of generative models in chess, revealing that most models are not sound and analyzing factors affecting their performance.
Contribution
It presents a novel adversarial verification method for generative sequence models and applies it to chess to analyze model soundness and training effects.
Findings
Most models are not sound in predicting valid chess sequences.
Training techniques and dataset choices influence model soundness.
Board state probes show limited causal role in predictions.
Abstract
Generative sequence models are typically trained on sample sequences from natural or formal languages. It is a crucial question whether -- or to what extent -- sample-based training is able to capture the true structure of these languages, often referred to as the ``world model''. Theoretical results indicate that we can hope for soundness at best, that is, generating valid sequences, but not necessarily all of them. However, it is still important to have practical tools that are able to verify whether a given sequence model is sound. In this study, we focus on chess, as it is a domain that provides enough complexity while having a simple rule-based world model. We propose adversarial sequence generation for verifying the soundness of the sequence model. Our adversaries generate valid sequences so as to force the sequence model to generate an invalid next move prediction. Apart from the…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper tackles an interesting and clearly defined problem, formulating world-model verification as falsification through adversarial legal sequences. 2. The chosen adversaries (RM, SMM, AD, BSO, IMO) form a meaningful distribution of strength and show how more powerful attacks reveal hidden inconsistencies in the model’s behaviour. 3. The paper is generally clearly written and has a fairly detailed analysis of performance patterns.
1. It is difficult to interpret the results without understanding how competent these GPT-2-based models are as chess players. A comparison to specialised chess language models or an Elo-style strength estimate would help determine whether the observed unsoundness is surprising or simply reflects limited gameplay skill. 2. The strong dependence on sequence length suggests that the models struggle with long-range reasoning. This is an interesting finding, but it mostly reflects the limitations of
- This paper provides a clean, new method of evaluating the ability of the a model to recover the underlying rules of its environment. The attack-based evaluation can be performed on any black-box model and is easy to understand. It avoids having to deal with many subtleties of model inference or choosing a background distribution that appear in other works. - The experimental setup evaluates several natural and interesting choices of attack, training objective, and training data. - The paper is
- It is not clear how the success rate is calculated, see the first question. - While easy to implement and generally applicable, notions like success rate of attacks are not very interpretable: they are a loose proxy. Does a model which has attack success rate 40% rather than 50% understand the underlying world model better? It is probably not that informative at that scale as the success rate depends heavily on the specific attack that is chosen. - The setup and experiments are all ran with
The paper has many nice strengths: 1) The idea of generating adversarial continuations is clever, and a natural way to assess the implicit world model of generative sequence models. The approach provides an ``existence'' proof of unsoundness. 2) The experimental scope of the paper is impressive --- by my read, 24 models trained across 6 datasets of varying sizes and qualities, evaluated with 5 different adversaries. The systematic experiments enable the authors to make interesting statements
I use this box to describe weaknesses and make comments that I'd like to see the authors address. 1) The authors criticize Vafa et al. (2024) for using an ad hoc probability threshold to define the generated language by the model, and claiming that their focus on adversaries avoids this. But there are analogous choices that must be made here. The authors focus exclusively on greedy decoding --- why not other choices? Would they perform differently? 2) The definition of world models as valid
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Adversarial Robustness in Machine Learning · Reinforcement Learning in Robotics
