Assessing and Mitigating Miscalibration in LLM-Based Social Science Measurement
Jinyuan Wang, Ningyuan Deng, Yi Yang

TL;DR
This paper examines the miscalibration of confidence scores in LLM-based social science measurements, demonstrating its impact and proposing a calibration method that improves accuracy across multiple models and tasks.
Contribution
It identifies widespread miscalibration in LLM confidence scores for social science tasks and introduces a soft label distillation method to significantly improve calibration accuracy.
Findings
Confidence scores are poorly aligned with correctness across models and tasks.
The proposed calibration method reduces ECE by 43.2% and Brier score by 34.0%.
Calibration should be integrated into LLM-based measurement pipelines for validity.
Abstract
Large language models (LLMs) are increasingly used in social science as scalable measurement tools for converting unstructured text into variables that can enter standard empirical designs. Measurement validity demands more than high average accuracy, which requires well calibrated confidence that faithfully reflects the empirical probability of each measurement being correct. This paper studies the model miscalibration in LLM-based social science measurement. We begin with a case study on FOMC and show that confidence based filtering can change downstream regression estimates when LLM confidence is miscalibrated. We then audit calibration across 14 social science constructs covering both proprietary models, including GPT-5-mini, DeepSeek-V3.2, and open source models. Across tasks and model families, reported confidence is poorly aligned with tolerance-based correctness. As a simple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
