Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes

Junghwan Lim; Sungmin Lee; Dongseok Kim; Taehyun Kim; Eunhwan Park; Jeesoo Lee; Jeongdoo Lee; Junhyeok Lee; Wai Ting Cheung; Dahye Choi; Minsu Ha; Jaeheui Her; Jaeyeon Huh; Hanbin Jung; Changjin Kang; Beomgyu Kim; Minjae Kim; Taewhan Kim; Youngrok Kim; Hyukjin Kweon; Haesol Lee; Kungyu Lee; Dongpin Oh; Yeongjae Park; Bokki Ryu; Dongjoo Weon

arXiv:2512.11463·cs.AI·December 15, 2025

Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes

Junghwan Lim, Sungmin Lee, Dongseok Kim, Taehyun Kim, Eunhwan Park, Jeesoo Lee, Jeongdoo Lee, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Minsu Ha, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Minjae Kim, Taewhan Kim, Youngrok Kim, Hyukjin Kweon, Haesol Lee

PDF

Open Access

TL;DR

Motif-2-12.7B-Reasoning is a practical, open-weight language model that employs a comprehensive training recipe to enhance reasoning and long-context understanding, achieving competitive performance with larger models.

Contribution

The paper presents a detailed, reproducible training methodology combining system, data, and algorithmic optimizations for a 12.7B parameter reasoning-focused language model.

Findings

01

Achieves performance comparable to larger models in reasoning tasks

02

Demonstrates effective training stability and model robustness

03

Provides a practical blueprint for scaling reasoning in language models

Abstract

We introduce Motif-2-12.7B-Reasoning, a 12.7B parameter language model designed to bridge the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding. Addressing the common challenges of model collapse and training instability in reasoning adaptation, we propose a comprehensive, reproducible training recipe spanning system, data, and algorithmic optimizations. Our approach combines memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel-level optimizations with a two-stage Supervised Fine-Tuning (SFT) curriculum that mitigates distribution mismatch through verified, aligned synthetic data. Furthermore, we detail a robust Reinforcement Learning Fine-Tuning (RLFT) pipeline that stabilizes training via difficulty-aware data filtering and mixed-policy trajectory reuse. Empirical results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)