Evaluating language models as risk scores

Andr\'e F. Cruz; Moritz Hardt; Celestine Mendler-D\"unner

arXiv:2407.14614·cs.LG·December 2, 2024

Evaluating language models as risk scores

Andr\'e F. Cruz, Moritz Hardt, Celestine Mendler-D\"unner

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper assesses how well large language models can serve as risk scores for predicting outcomes with uncertain ground truths, revealing calibration issues and the impact of prompting methods.

Contribution

Introduces folktexts, a software package for systematically generating and evaluating risk scores from LLMs on census data, highlighting calibration challenges.

Findings

01

Zero-shot multiple-choice risk scores have high predictive signal but poor calibration.

02

Instruction-tuned models tend to underestimate uncertainty and are over-confident.

03

Chat-style queries improve calibration of risk scores.

Abstract

Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate LLMs' ability to quantify ground-truth outcome uncertainty. In this work, we focus on the use of LLMs as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using LLMs, and evaluate them against US Census data products. A flexible API enables the use of different prompting schemes, local or web-hosted models, and diverse census columns that can be used to compose custom prediction tasks. We evaluate 17 recent LLMs across five proposed benchmark tasks. We find that zero-shot risk scores produced by multiple-choice question-answering have high predictive signal but are widely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

socialfoundations/folktexts
noneOfficial

Datasets

acruz/folktexts
dataset· 78 dl
78 dl

Videos

Evaluating language models as risk scores· slideslive

Taxonomy

TopicsResilience and Mental Health · Topic Modeling

MethodsFocus · Balanced Selection