BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs

Junxiao Yang; Jinzhe Tu; Haoran Liu; Xiaoce Wang; Chujie Zheng; Zhexin Zhang; Shiyao Cui; Caishun Chen; Tiantian He; Hongning Wang; Yew-Soon Ong; Minlie Huang

arXiv:2505.13529·cs.AI·February 26, 2026

BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs

Junxiao Yang, Jinzhe Tu, Haoran Liu, Xiaoce Wang, Chujie Zheng, Zhexin Zhang, Shiyao Cui, Caishun Chen, Tiantian He, Hongning Wang, Yew-Soon Ong, Minlie Huang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces BARREL, a boundary-aware reasoning framework for large reasoning models that reduces overconfidence and improves factual reliability in their answers.

Contribution

The paper proposes BARREL, a novel training framework that addresses overthinking patterns in LRMs, enhancing their factual reliability without sacrificing accuracy.

Findings

01

Reliability increased from 39.33% to 61.48% in DeepSeek-R1-Distill-Llama-8B.

02

BARREL reduces overconfidence and incorrect answers in LRMs.

03

Models maintain accuracy comparable to reasoning-data fine-tuning.

Abstract

Recent advances in Large Reasoning Models (LRMs) have shown impressive capabilities in mathematical and logical reasoning. However, current LRMs rarely admit ignorance or respond with "I don't know". Instead, they often produce incorrect answers while showing undue confidence, raising concerns about their factual reliability. In this work, we identify two pathological reasoning patterns characterized by overthinking that contribute to the overconfident and incorrect answers: last-minute guessing and second-thought spiraling. To address these issues, we propose BARREL-a novel framework that promotes concise and boundary-aware factual reasoning. Our experiments show that BARREL-training increases the reliability of DeepSeek-R1-Distill-Llama-8B from 39.33% to 61.48%, while still achieving accuracy comparable to models finetuned on reasoning data generated by R1. These results demonstrate…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. This paper provides a very clear motivation for fixing two pathological reasoning patterns in current Large Reasoning Models, which is a very important topic. 2. The writing of this paper is fluent and easy to follow. 3. The experiments and analysis in this paper are comprehensive to show the effectiveness of the proposed framework. All in all, this is a solid paper with clear structure.

Weaknesses

The only weaknesses I noticed are the definition of the metric "Reliability (Rel.)." The "ans." means the ratio of answered questions. So, why do we need to compute "ans.·Truth." and "(1−ans.)·Acc." and sum them together? What is the rationale behind this, and why do you think this is a more robust metric?

Reviewer 02Rating 6Confidence 4

Strengths

* Clear diagnosis of failure modes with concrete qualitative examples and a simple taxonomy (last-minute guessing vs. second-thought spiraling). * Conceptually clean training recipe: SFT to seed refusal patterns + GRPO that explicitly rewards calibrated abstention. * Consistent empirical improvements in reliability/truthfulness across multiple 7–8B backbones, with additional OOD refusal tests and pass@k analysis. * Ablations that matter: effect of refusal reward magnitude; role of SFT vs. GRP

Weaknesses

1. **Generalizability Issues in Reasoning Trace Construction** The paper states that reasoning traces are constructed by prompting GPT-4 with prompts from Appendix F and BARREL reasoning examples to generate compliant reasoning patterns. However, this generation approach may lead to insufficient diversity in the resulting traces. Additionally, as a non-reasoning-focused model, GPT-4 may not be optimal for generating high-quality Long-CoT-style reasoning processes. A more effective alternative

Reviewer 03Rating 4Confidence 4

Strengths

- The paper is well-motivated. Hallucinations are a major problem with LLMs and making models aware of their factual boundaries is an an important problem. The paper also correctly identifies that the standard paradigm does not reward models to express their uncertainty. - The paper is generally well-written and has a decent flow to it. - The results are impressive. Models maintain accuracy while also learning to abstain.

Weaknesses

- **Proxy Definition**: The "*knowledge labeling proxy*" (classifying a question as "known to the model") is a weak indicator of model knowledge. It fails to account for uncertainty when multiple plausible answers exist. Say a question has 10 possible answers. The model might not know the correct answer, but has some sampling probability for each of the answers. The proposed framework would classify this question as something known to the model, which seems strange. A majority-vote–based proxy m

Code & Models

Repositories

thu-coai/barrel
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Data Quality and Management