High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning

Tim Franzmeyer; Archie Sravankumar; Lijuan Liu; Yuning Mao; Rui Hou; Sinong Wang; Jakob N. Foerster; Luke Zettlemoyer; Madian Khabsa

arXiv:2506.04051·cs.CL·February 17, 2026

High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning

Tim Franzmeyer, Archie Sravankumar, Lijuan Liu, Yuning Mao, Rui Hou, Sinong Wang, Jakob N. Foerster, Luke Zettlemoyer, Madian Khabsa

PDF

Open Access 3 Reviews

TL;DR

HALT is a finetuning method for LLMs that improves reliability by enabling models to abstain from uncertain content, significantly increasing correctness while balancing response completeness across multiple domains.

Contribution

The paper introduces HALT, a novel post-training approach that aligns LLM capabilities with response confidence, reducing hallucinations and improving reliability.

Findings

01

Increases mean correctness of responses by 15% on average.

02

Improves F1 score by 4% compared to baselines.

03

Enhances correctness from 51% to 87% in a large Llama model.

Abstract

Large Language Models (LLMs) currently respond to every prompt. However, they can produce incorrect answers when they lack knowledge or capability -- a problem known as hallucination. We instead propose post-training an LLM to generate content only when confident in its correctness and to otherwise (partially) abstain. Specifically, our method, HALT, produces capability-aligned post-training data that encodes what the model can and cannot reliably generate. We generate this data by splitting responses of the pretrained LLM into factual fragments (atomic statements or reasoning steps), and use ground truth information to identify incorrect fragments. We achieve capability-aligned finetuning responses by either removing incorrect fragments or replacing them with "Unsure from Here" -- according to a tunable threshold that allows practitioners to trade off response completeness and mean…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- The paper is generally well-written and clear. - The motivation is strong; abstention and uncertainty quantification at claim-level are an important approach to decrease LM hallucination while retaining usefulness. - The method is simple and clever, in particular the idea to use multiple samples given an input along with their number of correct claims to control the correctness-completeness tradeoff. I suspect that this work could emerge as a straightforward, effective baseline for training LM

Weaknesses

- The authors should better contextualize with respect to prior work on finetuning language models to express their verbalized uncertainty [1] [2]. That setting is in some sense more challenging than the final experiment in the present paper, which marks fragments as uncertain instead of omitting them. - A number of heuristic choices are made which are not clearly ablated. For example, I would expect that many non-mathematical tasks have some causal dependency in consecutive sentences, and there

Reviewer 02Rating 4Confidence 4

Strengths

1. The core idea of this paper is intuitive and clearly presented. 2. The "Unsure from here" mechanism makes the model’s uncertainty explicitly interpretable, improving user trust and enabling controllable trade-offs between correctness and completeness.

Weaknesses

1. The method depends on an additional evaluator model to assess fragment correctness, which may introduce bias or inconsistency depending on the evaluator’s quality and alignment. 2. The method also relies on fragmentation, which feels somewhat heuristic to me. 3. The training process is complex, requiring additional time and computational cost.

Reviewer 03Rating 10Confidence 4

Strengths

1) The method laid out is clearly explained and takes a principled approach. I think the different ways to assess fragments based off the domain is good and it is good that the authors are acknowledging future work would need to involve dependency graphs. 2) The baselines provided seem to be good to compare their method against. 3) The study in the section "Finetuning on Few-Shot Prompted Responses is comparable to Finetuning on Ground Truth Responses" was important to run to trust this method

Weaknesses

1) One thing that is not clear is how does response completeness affect user experience? Obviously we want correct responses for the user but is a completeness score of 51% low? What's the best tradeoff? 2) There's some lack of discussion. In Figure 4 why is there a sharp dropoff for the MATH dataset but not for Wikibios. I might be missing something but it would be good to have this clarified.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education