BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs
Junxiao Yang, Jinzhe Tu, Haoran Liu, Xiaoce Wang, Chujie Zheng, Zhexin Zhang, Shiyao Cui, Caishun Chen, Tiantian He, Hongning Wang, Yew-Soon Ong, Minlie Huang

TL;DR
This paper introduces BARREL, a boundary-aware reasoning framework for large reasoning models that reduces overconfidence and improves factual reliability in their answers.
Contribution
The paper proposes BARREL, a novel training framework that addresses overthinking patterns in LRMs, enhancing their factual reliability without sacrificing accuracy.
Findings
Reliability increased from 39.33% to 61.48% in DeepSeek-R1-Distill-Llama-8B.
BARREL reduces overconfidence and incorrect answers in LRMs.
Models maintain accuracy comparable to reasoning-data fine-tuning.
Abstract
Recent advances in Large Reasoning Models (LRMs) have shown impressive capabilities in mathematical and logical reasoning. However, current LRMs rarely admit ignorance or respond with "I don't know". Instead, they often produce incorrect answers while showing undue confidence, raising concerns about their factual reliability. In this work, we identify two pathological reasoning patterns characterized by overthinking that contribute to the overconfident and incorrect answers: last-minute guessing and second-thought spiraling. To address these issues, we propose BARREL-a novel framework that promotes concise and boundary-aware factual reasoning. Our experiments show that BARREL-training increases the reliability of DeepSeek-R1-Distill-Llama-8B from 39.33% to 61.48%, while still achieving accuracy comparable to models finetuned on reasoning data generated by R1. These results demonstrate…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper provides a very clear motivation for fixing two pathological reasoning patterns in current Large Reasoning Models, which is a very important topic. 2. The writing of this paper is fluent and easy to follow. 3. The experiments and analysis in this paper are comprehensive to show the effectiveness of the proposed framework. All in all, this is a solid paper with clear structure.
The only weaknesses I noticed are the definition of the metric "Reliability (Rel.)." The "ans." means the ratio of answered questions. So, why do we need to compute "ans.·Truth." and "(1−ans.)·Acc." and sum them together? What is the rationale behind this, and why do you think this is a more robust metric?
* Clear diagnosis of failure modes with concrete qualitative examples and a simple taxonomy (last-minute guessing vs. second-thought spiraling). * Conceptually clean training recipe: SFT to seed refusal patterns + GRPO that explicitly rewards calibrated abstention. * Consistent empirical improvements in reliability/truthfulness across multiple 7–8B backbones, with additional OOD refusal tests and pass@k analysis. * Ablations that matter: effect of refusal reward magnitude; role of SFT vs. GRP
1. **Generalizability Issues in Reasoning Trace Construction** The paper states that reasoning traces are constructed by prompting GPT-4 with prompts from Appendix F and BARREL reasoning examples to generate compliant reasoning patterns. However, this generation approach may lead to insufficient diversity in the resulting traces. Additionally, as a non-reasoning-focused model, GPT-4 may not be optimal for generating high-quality Long-CoT-style reasoning processes. A more effective alternative
- The paper is well-motivated. Hallucinations are a major problem with LLMs and making models aware of their factual boundaries is an an important problem. The paper also correctly identifies that the standard paradigm does not reward models to express their uncertainty. - The paper is generally well-written and has a decent flow to it. - The results are impressive. Models maintain accuracy while also learning to abstain.
- **Proxy Definition**: The "*knowledge labeling proxy*" (classifying a question as "known to the model") is a weak indicator of model knowledge. It fails to account for uncertainty when multiple plausible answers exist. Say a question has 10 possible answers. The model might not know the correct answer, but has some sampling probability for each of the answers. The proposed framework would classify this question as something known to the model, which seems strange. A majority-vote–based proxy m
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Data Quality and Management
