Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection

Atoosa Chegini; Hamid Kazemi; Garrett Souza; Maria Safi; Yang Song; Samy Bengio; Sinead Williamson; Mehrdad Farajtabar

arXiv:2510.21049·cs.CL·October 27, 2025

Reasoning's Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection

Atoosa Chegini, Hamid Kazemi, Garrett Souza, Maria Safi, Yang Song, Samy Bengio, Sinead Williamson, Mehrdad Farajtabar

PDF

TL;DR

This study systematically evaluates how reasoning in large language models affects accuracy and recall in safety and hallucination detection tasks, revealing a trade-off where reasoning improves overall accuracy but can impair low-FPR performance.

Contribution

It provides the first comprehensive analysis of reasoning's impact on classification tasks under strict low-FPR conditions, highlighting when reasoning helps or hurts in safety-critical applications.

Findings

01

Reasoning improves overall accuracy but can reduce recall at low FPR thresholds.

02

Token-based scoring outperforms self-verbalized confidence in precision-sensitive settings.

03

Ensembling reasoning and non-reasoning modes recovers strengths of both approaches.

Abstract

Reasoning has become a central paradigm for large language models (LLMs), consistently boosting accuracy across diverse benchmarks. Yet its suitability for precision-sensitive tasks remains unclear. We present the first systematic study of reasoning for classification tasks under strict low false positive rate (FPR) regimes. Our analysis covers two tasks--safety detection and hallucination detection--evaluated in both fine-tuned and zero-shot settings, using standard LLMs and Large Reasoning Models (LRMs). Our results reveal a clear trade-off: Think On (reasoning-augmented) generation improves overall accuracy, but underperforms at the low-FPR thresholds essential for practical use. In contrast, Think Off (no reasoning during inference) dominates in these precision-sensitive regimes, with Think On surpassing only when higher FPRs are acceptable. In addition, we find token-based scoring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.