Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs

Ling Team; Bin Hu; Cai Chen; Deng Zhao; Ding Liu; Dingnan Jin; Feng Zhu; Hao Dai; Hongzhi Luan; Jia Guo; Jiaming Liu; Jiewei Wu; Jun Mei; Jun Zhou; Junbo Zhao; Junwu Xiong; Kaihong Zhang; Kuan Xu; Lei Liang; Liang Jiang; Liangcheng Fu; Longfei Zheng; Qiang Gao; Qing Cui; Quan Wan; Shaomian Zheng; Shuaicheng Li; Tongkai Yang; Wang Ren; Xiaodong Yan; Xiaopei Wan; Xiaoyun Feng; Xin Zhao; Xinxing Yang; Xinyu Kong; Xuemin Yang; Yang Li; Yingting Wu; Yongkang Liu; Zhankai Xu; Zhenduo Zhang; Zhenglei Zhou; Zhenyu Huang; Zhiqiang Zhang; Zihao Wang; Zujie Wen

arXiv:2506.14731·cs.CL·June 19, 2025

Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs

Ling Team, Bin Hu, Cai Chen, Deng Zhao, Ding Liu, Dingnan Jin, Feng Zhu, Hao Dai, Hongzhi Luan, Jia Guo, Jiaming Liu, Jiewei Wu, Jun Mei, Jun Zhou, Junbo Zhao, Junwu Xiong, Kaihong Zhang, Kuan Xu, Lei Liang, Liang Jiang, Liangcheng Fu, Longfei Zheng, Qiang Gao, Qing Cui

PDF

Open Access 3 Models 2 Datasets

TL;DR

Ring-lite is a scalable, efficient large language model that combines MoE and reinforcement learning, achieving strong reasoning performance with fewer activated parameters and improved training stability.

Contribution

The paper introduces C3PO, a novel RL training algorithm for MoE models, and a two-stage training paradigm for multi-domain data integration, enhancing stability and efficiency.

Findings

01

Matches SOTA small-scale reasoning models on benchmarks

02

Activates only one-third of parameters compared to similar models

03

Demonstrates improved training stability with C3PO

Abstract

We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Multi-Agent Systems and Negotiation · Natural Language Processing Techniques

MethodsMixture of Experts