Annotation-Efficient Universal Honesty Alignment

Shiyu Ni; Keping Bi; Jiafeng Guo; Minghao Tang; Jingtong Wu; Zengxin Han; Xueqi Cheng

arXiv:2510.17509·cs.CL·March 5, 2026

Annotation-Efficient Universal Honesty Alignment

Shiyu Ni, Keping Bi, Jiafeng Guo, Minghao Tang, Jingtong Wu, Zengxin Han, Xueqi Cheng

PDF

Open Access 3 Models 1 Datasets

TL;DR

This paper introduces EliCal, a two-stage, annotation-efficient framework for training large language models to recognize their knowledge limits and express calibrated confidence, using minimal supervision.

Contribution

EliCal combines inexpensive self-consistency supervision with a small set of correctness annotations to achieve universal honesty alignment in LLMs.

Findings

01

EliCal achieves near-optimal honesty alignment with only 1k annotations.

02

EliCal outperforms calibration-only baselines on unseen tasks.

03

HonestyBench provides a large-scale benchmark for honesty alignment.

Abstract

Honesty alignment-the ability of large language models (LLMs) to recognize their knowledge boundaries and express calibrated confidence-is essential for trustworthy deployment. Existing methods either rely on training-free confidence estimation (e.g., token probabilities, self-consistency) or training-based calibration with correctness annotations. While effective, achieving universal honesty alignment with training-based calibration requires costly, large-scale labeling. To support annotation-efficient training, we introduce Elicitation-Then-Calibration (EliCal), a two-stage framework that first elicits internal confidence using inexpensive self-consistency supervision, then calibrates this confidence with a small set of correctness annotations. To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Trustworthy-Information-Access/HonestyBench
dataset· 364 dl
364 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Topic Modeling