Mitigating Label Length Bias in Large Language Models
Mario Sanz-Guerrero, Katharina von der Wense

TL;DR
This paper introduces normalized contextual calibration (NCC), a novel method to reduce label length bias in large language models, significantly improving their accuracy and robustness in multi-token classification tasks.
Contribution
The paper proposes NCC, a new calibration technique that normalizes and calibrates predictions at the full-label level to mitigate label length bias in LLMs.
Findings
NCC improves F1 scores by up to 10% across datasets and models.
NCC reduces sensitivity to few-shot example selection.
NCC enhances confidence estimate reliability.
Abstract
Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call label length bias, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose normalized contextual calibration (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Text and Document Classification Technologies
