INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning

Prime Intellect Team; Sami Jaghouar; Justus Mattern; Jack Min Ong; Jannik Straube; Manveer Basra; Aaron Pazdera; Kushal Thaman; Matthew Di Ferrante; Felix Gabriel; Fares Obeid; Kemal Erdem; Michael Keiblinger; Johannes Hagemann

arXiv:2505.07291·cs.LG·May 13, 2025

INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning

Prime Intellect Team, Sami Jaghouar, Justus Mattern, Jack Min Ong, Jannik Straube, Manveer Basra, Aaron Pazdera, Kushal Thaman, Matthew Di Ferrante, Felix Gabriel, Fares Obeid, Kemal Erdem, Michael Keiblinger, Johannes Hagemann

PDF

Open Access 2 Models 1 Datasets

TL;DR

INTELLECT-2 is a pioneering 32-billion-parameter reasoning model trained via a fully decentralized, asynchronous reinforcement learning approach across a heterogeneous swarm of contributors, introducing new infrastructure and training techniques.

Contribution

It presents the first large-scale decentralized RL training framework for language models, with novel components and modifications that improve training stability and model performance.

Findings

01

Achieved state-of-the-art reasoning performance in 32B models.

02

Developed and open-sourced a complete decentralized training infrastructure.

03

Demonstrated successful training of a large-scale reasoning model without centralized control.

Abstract

We introduce INTELLECT-2, the first globally distributed reinforcement learning (RL) training run of a 32 billion parameter language model. Unlike traditional centralized training efforts, INTELLECT-2 trains a reasoning model using fully asynchronous RL across a dynamic, heterogeneous swarm of permissionless compute contributors. To enable a training run with this unique infrastructure, we built various components from scratch: we introduce PRIME-RL, our training framework purpose-built for distributed asynchronous reinforcement learning, based on top of novel components such as TOPLOC, which verifies rollouts from untrusted inference workers, and SHARDCAST, which efficiently broadcasts policy weights from training nodes to inference workers. Beyond infrastructure components, we propose modifications to the standard GRPO training recipe and data filtering techniques that were…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

PrimeIntellect/INTELLECT-2-RL-Dataset
dataset· 108 dl
108 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications