ElicitationGPT: Text Elicitation Mechanisms via Language Models

Yifan Wu; Jason Hartline

arXiv:2406.09363·cs.AI·November 13, 2025

ElicitationGPT: Text Elicitation Mechanisms via Language Models

Yifan Wu, Jason Hartline

PDF

Open Access

TL;DR

This paper introduces ElicitationGPT, a novel mechanism that uses large language models to evaluate and score textual forecasts against ground truth, ensuring proper incentives and aligning with human preferences.

Contribution

It develops a domain-knowledge-free, provably proper scoring mechanism for text elicitation using black-box language models, validated through empirical evaluation.

Findings

01

Mechanism aligns well with human preferences in peer review evaluation

02

Provably proper scoring achieved via black-box language models

03

Empirical results demonstrate effectiveness in real-world peer grading

Abstract

Scoring rules evaluate probabilistic forecasts of an unknown state against the realized state and are a fundamental building block in the incentivized elicitation of information. This paper develops mechanisms for scoring elicited text against ground truth text by reducing the textual information elicitation problem to a forecast elicitation problem, via domain-knowledge-free queries to a large language model (specifically ChatGPT), and empirically evaluates their alignment with human preferences. Our theoretical analysis shows that the reduction achieves provable properness via black-box language models. The empirical evaluation is conducted on peer reviews from a peer-grading dataset, in comparison to manual instructor scores for the peer reviews. Our results suggest a paradigm of algorithmic artificial intelligence that may be useful for developing artificial intelligence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling