PRISM: Demystifying Retention and Interaction in Mid-Training

Bharat Runwal; Ashish Agrawal; Anurag Roy; Rameswar Panda

arXiv:2603.17074·cs.LG·March 25, 2026

PRISM: Demystifying Retention and Interaction in Mid-Training

Bharat Runwal, Ashish Agrawal, Anurag Roy, Rameswar Panda

PDF

Open Access

TL;DR

This paper introduces PRISM, an empirical study demonstrating that mid-training on high-quality data significantly improves large language models' reasoning abilities, with detailed insights into training, data, and reinforcement learning effects.

Contribution

PRISM provides a comprehensive analysis of mid-training design choices, showing its effectiveness for reasoning tasks and offering practical guidance for robust model training pipelines.

Findings

01

Mid-training on 27B tokens yields significant performance gains.

02

Full PRISM to RL pipeline greatly improves reasoning benchmarks.

03

Data composition during mid-training is more impactful than RL adjustments.

Abstract

We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Materials Science