Hallucination as output-boundary misclassification: a composite abstention architecture for language models
Angelina Hintsanen

TL;DR
This paper introduces a composite abstention architecture combining instruction-based refusal and a structural support deficit gate to reduce hallucinations in large language models.
Contribution
It proposes a novel composite intervention that effectively combines two mechanisms to control hallucinations, outperforming individual methods.
Findings
The composite architecture achieves high accuracy with low hallucination across multiple models.
Instruction prompting reduces hallucination but causes over-caution and residual errors.
Structural gating maintains answer accuracy and provides a baseline abstention capability.
Abstract
Large language models often produce unsupported claims. We frame this as a misclassification error at the output boundary, where internally generated completions are emitted as if they were grounded in evidence. This motivates a composite intervention that combines instruction-based refusal with a structural abstention gate. The gate computes a support deficit score, St, from three black-box signals: self-consistency (At), paraphrase stability (Pt), and citation coverage (Ct), and blocks output when St exceeds a threshold. In a controlled evaluation across 50 items, five epistemic regimes, and three models, neither mechanism alone was sufficient. Instruction-only prompting reduced hallucination sharply, but still showed over-cautious abstention on answerable items and residual hallucination for GPT-3.5-turbo. The structural gate preserved answerable accuracy across models but missed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
