Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability

Tao Tao; Maissam Barkeshli

arXiv:2510.26792·cs.LG·February 18, 2026

Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability

Tao Tao, Maissam Barkeshli

PDF

3 Reviews

TL;DR

This paper demonstrates that Transformer models can learn to predict sequences generated by complex pseudo-random number generators, revealing insights into their structure, scalability, and interpretability through extensive experiments and analysis.

Contribution

It shows that Transformers can effectively learn PCG sequences, including large moduli, and introduces a curriculum learning approach and interpretability insights into the model's internal representations.

Findings

01

Transformers successfully predict PCG sequences beyond classical attack capabilities.

02

Prediction accuracy scales with the square root of the modulus size.

03

Embedding analysis reveals clustering based on bitwise rotations.

Abstract

We study the ability of Transformer models to learn sequences generated by Permuted Congruential Generators (PCGs), a widely used family of pseudo-random number generators (PRNGs). PCGs introduce substantial additional difficulty over linear congruential generators (LCGs) by applying a series of bit-wise shifts, XORs, rotations and truncations to the hidden state. We show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants, in tasks that are beyond published classical attacks. In our experiments we scale moduli up to $2^{22}$ using up to $50$ million model parameters and datasets with up to $5$ billion tokens. Surprisingly, we find even when the output is truncated to a single bit, it can be reliably predicted by the model. When multiple distinct PRNGs are presented together during training, the model can jointly…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper is well written. The problem setting and experimental results are clearly presented. 2. I find most of the findings of the paper very interesting (especially the benefits of curriculum learning with the help of a smaller modulus, and the principal component analyses of embedding vectors). Even though the problem scale (e.g., number of bits required to represent the problem) studied here is quite smaller than the practically used PCGs, I believe it will be a stimulating example for m

Weaknesses

1. The observation that transformers can learn PCG-generated sequences auto-regressively may not be a very surprising finding. In fact, the studied problem is not that random since its scale is too small to pass a collection of empirical randomness tests (e.g., BigCrush). Hence, the problem possesses its own auto-regressive nature by its definition, which can be effectively solved with Transformers up to some extent. 2. Indeed, there has already been a huge literature on in-context learning with

Reviewer 02Rating 4Confidence 3

Strengths

**S1.** The paper is clearly written, and the figures are well-designed. **S2.** The authors conduct extensive experiments to support the paper’s claims, and the results convincingly substantiate those claims.

Weaknesses

**W1.** The PCG setup, where a hidden state $s_i$ evolves and the observation $x_i$ is produced via a deterministic function $f$, is conceptually close to HMMs and finite-state automata. To my knowledge, there is already substantial work probing Transformers’ capability on learning HMM/automata-like processes [1,2]. I think the paper should clarify what is genuinely new here versus what might already follow from known results on those finite-state structures. **W2.** The PCA-based analysis sugg

Reviewer 03Rating 8Confidence 3

Strengths

- PCGs are practically relevant and designed to be statistically hard to predict. This work effectively establishes a new, ML-based approach to cryptanalysis and could evolve into a practical benchmark or tool for evaluating the security of other PRNGs. The experiments are throughout and comprehensive. - The paper provides new insights into the expressivity of transformers, demonstrating they can model surprisingly complex, non-linear bitwise operations (not just simple arithmetic). The ablatio

Weaknesses

- The data efficiency of the proposed method is not great. It is unclear that if the accuracy mainly comes from memorizing similar patterns. - (Minor) The findings around interpretability is not particularly novel, as previous studies on grokking and probing revealed similar geometric structures. The paper will be even better with a more in-depth mechanistic interpretability analysis.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.