The Free Transformer

Fran\c{c}ois Fleuret

arXiv:2510.17558·cs.LG·October 21, 2025

The Free Transformer

Fran\c{c}ois Fleuret

PDF

Open Access 1 Video 3 Reviews

TL;DR

The paper introduces a variational extension of the decoder Transformer that conditions its generation on learned latent variables, leading to significant improvements in downstream task performance.

Contribution

It presents a novel variational approach to condition Transformer decoders on learned latent variables without supervision.

Findings

01

Substantial improvements on downstream tasks.

02

Effective learning of latent variables without supervision.

03

Enhanced generative capabilities of the Transformer.

Abstract

We propose an extension of the decoder Transformer that conditions its generative process on random latent variables which are learned without supervision thanks to a variational procedure. Experimental evaluations show that allowing such a conditioning translates into substantial improvements on downstream tasks.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

This paper contributes interesting insights into latent variable models for LLMs and a collection of effective methods for making the latent variables work as random choices about future outputs. Simply adding randomness to hidden representations would not have the same effect, since during LM training these random choices would be independent of future outputs. A central feature of this latent variable model is that there is one latent vector per token, which is novel with respect to the prev

Weaknesses

The paper seems to be work in progress. There is very little discussion of the empirical results, and no ablation studies other than a standard LLM baseline. The description of the model misses some key points (see below). The experiments don't seem to be testing any hypothesis. The paper reads like they have a cool idea, so lets see what happens. The conclusion is that something interesting happens, but it is not clear what. This impression is reinforced by the lack of any ablation studies

Reviewer 02Rating 4Confidence 5

Strengths

1. Motivation: The core idea is compelling: enabling models to use explicit latent "plans" rather than relying on purely "post-hoc" autoregressive token-level decisions. 2. Efficiency: The design is highly practical, incurring a small training overhead and zero inference-time cost. 3. Novelty: The "KL Governor" is a creative and new approach to managing the notoriously unstable VAE training objective. 4. Empirical Signals: The strong performance gains on complex coding and math reasoning task

Weaknesses

The paper's central claim, while promising, rests on assumptions that would be significantly strengthened by further validation. 1. The Decoder is Not Constrained, and Its Usage of $Z$ is Unproven. The paper's core thesis rests on the decoder using the latent $Z$. However, the proposed "KL Governor" only constrains the encoder to produce an informative $Z$; it does not place any direct constraint on the decoder. This leaves a critical question unanswered: it is unclear if the powerful autoregr

Reviewer 03Rating 2Confidence 4

Strengths

I really appreciate the author for the simple and smart design, which includes minimal changes to the architecture. And according to the paper, the training and inference time doesn't affect a lot by the latent design. The model has a clear and explicit latent variable to control the generation, or probably perform some kind of reasoning in latent space. I think this is a great approach and "reasoning in the latent space" is a promising direction.

Weaknesses

1. Model-related issue * A big autoregressive decoder induces a complex, non-Gaussian posterior, but the paper uses only 2 non-causal layers to output a Gaussian posterior. This is a very narrow variational family and likely leaves an amortization gap. * I understand the author want to solve this posterior collapse issue using fixed KL. However, this approach must spend the same KL budget for all samples, so easy examples get unnecessary noise and hard examples cannot request more bits. This di

Videos

[Paper Analysis] The Free Transformer (and some Variational Autoencoder stuff)· youtube

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Stochastic Gradient Optimization Techniques