CC-LEARN: Cohort-based Consistency Learning
Xiao Ye, Shaswat Shrivastava, Zhaonan Li, Jacob Dineen, Shijie Lu, Avneet Ahuja, Ming Shen, Zhikun Xu, Ben Zhou

TL;DR
CC-Learn is a reinforcement learning framework that enhances large language models' reasoning consistency by training on cohorts of similar questions, leading to improved accuracy and stability on reasoning benchmarks.
Contribution
Introduces cohort-based consistency learning with a novel composite objective for reinforcement learning to improve LLM reasoning reliability.
Findings
Boosts accuracy on reasoning benchmarks
Enhances reasoning stability and consistency
Outperforms supervised fine-tuning baselines
Abstract
Large language models excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning. Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members. Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT…
Peer Reviews
Decision·Submitted to ICLR 2026
- The approach delivers large absolute improvements (10–20+ points) compared to strong baselines, consistent across various datasets, model scales (3B and 7B), and evaluation settings (Lenient and Strict). - The model also performs strongly on out-of-domain benchmarks, indicating that the reasoning strategies learned during training generalize well to unseen tasks and domains.
1. **High reliance on subjective critique rewards:** CC-LEARN’s performance depends heavily on two critique-based rewards, ( $R_{fc}$ and $R_{sa}$ ), both provided by a “Judge” model. $R_{fc}$ is a qualitative score (1–10) reflecting how well the Judge thinks the program covers key factors—an inherently subjective judgment. $R_{sa}$ measures structural similarity to an “improved” program $p^{+}$ that the Judge itself creates, meaning the model is rewarded for resembling the Judge’s output rat
- The paper addresses an important issue in LLM reasoning, where the model can be potentially rewarded for providing an incorrect reasoning because the final answer it outputs passes problem-specific tests. - The method proposed to address this issue is innovative and builds on prior work in deep learning literature in the context of LLM reasoning: ensuring that the solution produced generalises across a cohort of samples instead of a single sample, which makes it much harder for the model to ch
- For most results in table 1, the 7B model seems to show similar performance for RL-normal and RL-cohort with execution-based and critique-based rewards (within confidence interval). Since the gap is not statistically significant, it seems to indicate that the gains compared to the rest of the baselines for the 7B model are due to the reward function design, not cohort-based consistency enforcement. - Most of the statistically significant gains in Table 1 seem to be for the 3B model, so studyin
- The paper introduces a framework that groups semantically similar questions into cohorts and trains the model to produce a single executable reasoning program shared across them. This design directly targets consistency by forcing uniform reasoning across paraphrased inputs, reducing random output variance and improving stability in reasoning behavior. - Training is guided by a composite reward that combines execution accuracy, retrieval efficiency, and structural critiques from a frozen judg
- K-of-N scoring rewards group success but may mask problematic individual cases. Absent calibrated confidence or human-in-the-loop triggers, ambiguous items can be mishandled. That limits suitability in high-stakes settings. - Similar-question cohorts and SFT programs are synthesized by frontier LLMs, then only a small sample is human-checked, which risks template artifacts or label bias leaking into both train and test. - Figure 2 contains overlapping text.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Topic Modeling · Context-Aware Activity Recognition Systems
