CC-LEARN: Cohort-based Consistency Learning

Xiao Ye; Shaswat Shrivastava; Zhaonan Li; Jacob Dineen; Shijie Lu; Avneet Ahuja; Ming Shen; Zhikun Xu; Ben Zhou

arXiv:2506.15662·cs.CL·June 19, 2025

CC-LEARN: Cohort-based Consistency Learning

Xiao Ye, Shaswat Shrivastava, Zhaonan Li, Jacob Dineen, Shijie Lu, Avneet Ahuja, Ming Shen, Zhikun Xu, Ben Zhou

PDF

Open Access 3 Reviews

TL;DR

CC-Learn is a reinforcement learning framework that enhances large language models' reasoning consistency by training on cohorts of similar questions, leading to improved accuracy and stability on reasoning benchmarks.

Contribution

Introduces cohort-based consistency learning with a novel composite objective for reinforcement learning to improve LLM reasoning reliability.

Findings

01

Boosts accuracy on reasoning benchmarks

02

Enhances reasoning stability and consistency

03

Outperforms supervised fine-tuning baselines

Abstract

Large language models excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning. Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members. Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

- The approach delivers large absolute improvements (10–20+ points) compared to strong baselines, consistent across various datasets, model scales (3B and 7B), and evaluation settings (Lenient and Strict). - The model also performs strongly on out-of-domain benchmarks, indicating that the reasoning strategies learned during training generalize well to unseen tasks and domains.

Weaknesses

1. **High reliance on subjective critique rewards:** CC-LEARN’s performance depends heavily on two critique-based rewards, ( $R_{fc}$ and $R_{sa}$ ), both provided by a “Judge” model. $R_{fc}$ is a qualitative score (1–10) reflecting how well the Judge thinks the program covers key factors—an inherently subjective judgment. $R_{sa}$ measures structural similarity to an “improved” program $p^{+}$ that the Judge itself creates, meaning the model is rewarded for resembling the Judge’s output rat

Reviewer 02Rating 4Confidence 4

Strengths

- The paper addresses an important issue in LLM reasoning, where the model can be potentially rewarded for providing an incorrect reasoning because the final answer it outputs passes problem-specific tests. - The method proposed to address this issue is innovative and builds on prior work in deep learning literature in the context of LLM reasoning: ensuring that the solution produced generalises across a cohort of samples instead of a single sample, which makes it much harder for the model to ch

Weaknesses

- For most results in table 1, the 7B model seems to show similar performance for RL-normal and RL-cohort with execution-based and critique-based rewards (within confidence interval). Since the gap is not statistically significant, it seems to indicate that the gains compared to the rest of the baselines for the 7B model are due to the reward function design, not cohort-based consistency enforcement. - Most of the statistically significant gains in Table 1 seem to be for the 3B model, so studyin

Reviewer 03Rating 6Confidence 2

Strengths

- The paper introduces a framework that groups semantically similar questions into cohorts and trains the model to produce a single executable reasoning program shared across them. This design directly targets consistency by forcing uniform reasoning across paraphrased inputs, reducing random output variance and improving stability in reasoning behavior. - Training is guided by a composite reward that combines execution accuracy, retrieval efficiency, and structural critiques from a frozen judg

Weaknesses

- K-of-N scoring rewards group success but may mask problematic individual cases. Absent calibrated confidence or human-in-the-loop triggers, ambiguous items can be mishandled. That limits suitability in high-stakes settings. - Similar-question cohorts and SFT programs are synthesized by frontier LLMs, then only a small sample is human-checked, which risks template artifacts or label bias leaking into both train and test. - Figure 2 contains overlapping text.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Topic Modeling · Context-Aware Activity Recognition Systems