ClimateX: Do LLMs Accurately Assess Human Expert Confidence in Climate Statements?
Romain Lacombe, Kerrie Wu, Eddie Dilworth

TL;DR
This paper introduces ClimateX, a dataset of climate statements with expert confidence labels, and evaluates LLMs' ability to assess human expert confidence, revealing limited accuracy and over-confidence issues.
Contribution
The paper presents ClimateX, a new expert-labeled dataset for climate statement confidence, and evaluates LLMs' performance in classifying expert confidence levels.
Findings
LLMs achieve up to 47% accuracy in classifying confidence levels.
Models tend to be over-confident on low and medium confidence statements.
Few-shot learning improves classification performance.
Abstract
Evaluating the accuracy of outputs generated by Large Language Models (LLMs) is especially important in the climate science and policy domain. We introduce the Expert Confidence in Climate Statements (ClimateX) dataset, a novel, curated, expert-labeled dataset consisting of 8094 climate statements collected from the latest Intergovernmental Panel on Climate Change (IPCC) reports, labeled with their associated confidence levels. Using this dataset, we show that recent LLMs can classify human expert confidence in climate-related statements, especially in a few-shot learning setting, but with limited (up to 47%) accuracy. Overall, models exhibit consistent and significant over-confidence on low and medium confidence statements. We highlight implications of our results for climate communication, LLMs evaluation strategies, and the use of LLMs in information retrieval systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsClimate Change Communication and Perception · Computational and Text Analysis Methods · Expert finding and Q&A systems
