CARE-RFT: Confidence-Anchored Reinforcement Finetuning for Reliable Reasoning in Large Language Models

Shuozhe Li; Jincheng Cao; Bodun Hu; Aryan Mokhtari; Leqi Liu; Amy Zhang

arXiv:2602.00085·cs.LG·February 3, 2026

CARE-RFT: Confidence-Anchored Reinforcement Finetuning for Reliable Reasoning in Large Language Models

Shuozhe Li, Jincheng Cao, Bodun Hu, Aryan Mokhtari, Leqi Liu, Amy Zhang

PDF

Open Access 3 Reviews

TL;DR

CARE-RFT introduces a confidence-anchored regularization technique for reinforcement finetuning large language models, balancing reasoning ability with trustworthiness and calibration by adaptively penalizing exploration based on confidence.

Contribution

It proposes a novel skew reverse KL divergence regularization method that improves the trade-off between reasoning performance and trustworthiness in RFT.

Findings

01

CARE-RFT matches reasoning performance of unconstrained RFT.

02

It recovers trustworthiness and calibration of the base model.

03

The method is effective across multiple model scales.

Abstract

Reinforcement finetuning (RFT) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, we identify a critical trade-off: while unconstrained RFT achieves strong reasoning performance, it severely compromises model trustworthiness by amplifying hallucination and worsening calibration; conversely, RKL-constrained RFT preserves trustworthiness but limits reasoning gains due to its unbounded penalty on exploratory deviations. To resolve this tension, we introduce CARE-RFT (Confidence-Anchored Regularized Reinforcement Finetuning), a novel method that replaces standard reverse KL regularization with a skew reverse KL divergence. CARE-RFT provides a confidence-sensitive penalty: it is bounded for confident, consistently rewarded explorations to enable reasoning, while unbounded elsewhere to preserve calibration. Extensive experiments across…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The paper tackles a practically important issue in reinforcement finetuning (RFT) — the trade-off between reasoning performance and trustworthiness. The idea of using a skew reverse KL regularizer to control confidence-dependent updates is conceptually sound and adapts classical divergence theory in a clear, incremental way. Empirically, the study is solid within its scope: it evaluates two model scales (3B, 7B), three representative RFT algorithms (GRPO, DAPO, GSPO), and several reasoning and

Weaknesses

1.Regularizer-strength trade-off is under-explored. The paper provides a clear ablation over the skew parameter α for SRKL (Table 3), showing that α≈0.8 yields a good balance between reasoning and trustworthiness. However, the divergence strength β is fixed at 0.04 for both RKL and CARE-RFT, and its effect is not studied. This leaves open whether CARE’s advantage persists across a broader range of constraint strengths, or whether a well-tuned RKL baseline could close much of the gap in the reaso

Reviewer 02Rating 4Confidence 3

Strengths

The strength of this paper is that it raises a highly relevant problem within the current research and temporal context. Recently, reinforcement learning-like techniques such as RFT, GRPO, and DAPO have been actively studied to enhance the reasoning abilities of LLMs. However, most of these are trained solely on rewards based on correct/incorrect answers, which has exposed a problem where models become progressively overconfident and lose calibration. CARE-RFT diagnoses the root cause of this

Weaknesses

The paper's main weakness is the lack of mathematical justification linking the proposed regularization term (SRKL) and calibration, specifically regarding its connection to **proper scoring rules**. CARE-RFT claims to mitigate overconfidence via "confidence-anchored regularization." However, it provides no theoretical rationale for whether this regularization actually ensures "properness"—that is, consistency between the predicted probabilities and the true answer distribution. In other words

Reviewer 03Rating 4Confidence 3

Strengths

- **Potentially impactful method**: KL-based RL training can prevent exploration, while unconstrained training can lead to loss of calibration and hallucinations. The proposed method strikes a balance between the two and can be generally useful. - **Intuitive writing**: The paper motivates the problem well, and the method is presented in an intuitive way. The paper would be even stronger if intuition is backed with solid theory on why the divergence chosen by the authors is the correct choice.

Weaknesses

- **Limited results**: RL training is performed only on a single dataset (MATH), which introduces doubts about generality. This is further compounded by the fact that the MATH training dataset is small (7000 examples), and actual practice is to use much larger datasets for math training (>20K questions). Very limited ablations and analysis are presented. The authors analyze the entropy curves and find that entropy collapses in unconstrained GRPO, but do not try a method with an entropy loss. -

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks