Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

Seijin Kobayashi; Yanick Schimpf; Maximilian Schlegel; Angelika Steger; Maciej Wolczyk; Johannes von Oswald; Nino Scherrer; Kaitlin Maile; Guillaume Lajoie; Blake A. Richards; Rif A. Saurous; James Manyika; Blaise Ag\"uera y Arcas; Alexander Meulemans; Jo\~ao Sacramento

arXiv:2512.20605·cs.LG·December 25, 2025

Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk, Johannes von Oswald, Nino Scherrer, Kaitlin Maile, Guillaume Lajoie, Blake A. Richards, Rif A. Saurous, James Manyika, Blaise Ag\"uera y Arcas, Alexander Meulemans, Jo\~ao Sacramento

PDF

Open Access

TL;DR

This paper introduces a hierarchical reinforcement learning approach using autoregressive models' internal representations, enabling efficient exploration and learning from sparse rewards by controlling internal activations with higher-order sequence models.

Contribution

It presents a novel higher-order, non-causal sequence model that controls autoregressive model activations, facilitating hierarchical RL and internal reinforcement learning within foundation models.

Findings

01

Higher-order models learn to compress long activation sequences.

02

Controllers execute long-timescale, meaningful actions with learned termination.

03

Internal RL improves learning from sparse rewards where standard RL fails.

Abstract

Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Neural Networks and Reservoir Computing