Evaluating language models as risk scores
Andr\'e F. Cruz, Moritz Hardt, Celestine Mendler-D\"unner

TL;DR
This paper assesses how well large language models can serve as risk scores for predicting outcomes with uncertain ground truths, revealing calibration issues and the impact of prompting methods.
Contribution
Introduces folktexts, a software package for systematically generating and evaluating risk scores from LLMs on census data, highlighting calibration challenges.
Findings
Zero-shot multiple-choice risk scores have high predictive signal but poor calibration.
Instruction-tuned models tend to underestimate uncertainty and are over-confident.
Chat-style queries improve calibration of risk scores.
Abstract
Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate LLMs' ability to quantify ground-truth outcome uncertainty. In this work, we focus on the use of LLMs as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using LLMs, and evaluate them against US Census data products. A flexible API enables the use of different prompting schemes, local or web-hosted models, and diverse census columns that can be used to compose custom prediction tasks. We evaluate 17 recent LLMs across five proposed benchmark tasks. We find that zero-shot risk scores produced by multiple-choice question-answering have high predictive signal but are widely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsResilience and Mental Health · Topic Modeling
MethodsFocus · Balanced Selection
