CLINB: A Climate Intelligence Benchmark for Foundational Models
Michelle Chen Huebscher, Katharine Mach, Aleksandar Stani\'c, Markus Leippold, Ben Gaiarin, Zeke Hausfather, Elisa Rawat, Erich Fischer, Massimiliano Ciaramita, Joeri Rogelj, Christian Buck, Lierni Sestorain Saralegui, Reto Knutti

TL;DR
CLINB introduces a comprehensive benchmark for evaluating large language models' ability to handle complex, multimodal climate change questions, highlighting strengths in knowledge synthesis but challenges in evidence grounding and attribution.
Contribution
This paper presents CLINB, a novel benchmark with real user questions and expert-curated rubrics, to assess LLMs on grounded, multimodal climate knowledge tasks.
Findings
Frontier models show PhD-level understanding and quality in answers.
Models outperform hybrid expert-assisted answers in synthesis.
Significant hallucination rates in references and images.
Abstract
Evaluating how Large Language Models (LLMs) handle complex, specialized knowledge remains a critical challenge. We address this through the lens of climate change by introducing CLINB, a benchmark that assesses models on open-ended, grounded, multimodal question answering tasks with clear requirements for knowledge quality and evidential support. CLINB relies on a dataset of real users' questions and evaluation rubrics curated by leading climate scientists. We implement and validate a model-based evaluation process and evaluate several frontier models. Our findings reveal a critical dichotomy. Frontier models demonstrate remarkable knowledge synthesis capabilities, often exhibiting PhD-level understanding and presentation quality. They outperform "hybrid" answers curated by domain experts assisted by weaker models. However, this performance is countered by failures in grounding. The…
Peer Reviews
Decision·Submitted to ICLR 2026
The benchmark includes complex questions that demand research, assessment of evidence, and careful synthesis, and it expects long-form answers that may include visuals and must be grounded with references and citations. The evaluation is strengthened by feedback from domain experts at each stage, with detailed grading rubrics that are created and verified, and scientists who contribute to data creation and check the scientific validity of the results.
(1) The paper evaluates references and visuals by checking whether links exist, not whether the cited content is correct, relevant, or high quality. This can make an answer appear grounded even when the source does not support the claim or the visual is mismatched. (2) The hybrid answer generation uses a weaker base model, and the rationale for this choice is not fully explained. The hybrids that use stronger base models, or at least a clear standalone baseline for the weaker model in the same
* This paper introduces the first expert-validated, open-ended, multimodal benchmark focused on climate intelligence, moving beyond traditional multiple-choice or trivia-style tests. * Unlike traditional multiple-choice or trivia-style evaluations, CLINB focuses on realistic, evidence-based scientific reasoning. The benchmark is built through collaboration among scientists, domain experts, and informed non-experts, which ensures both diversity and data quality. I especially appreciate the emph
1. While the author argue its the first work on climate evalution work, there are still some relevant datasets already exists. [1] [2] The author didn't discuss these related works and compare the difference. 2. This paper claims that the model has reached a PhD-level capability, yet it still exhibits significant hallucinations in grounding. The two points are contradictory, which makes the argument unconvincing. 3. There is not any methodology novelty. Even though the zero-shot prompting is
I think the primary strength of this paper is its rigorous methodology. It moves beyond simple, closed-form benchmarks to tackle the much harder problem of evaluating open-ended, specialized scientific communication. The three-phase data creation process, involving multiple tiers of human experts and scientists, is thorough. It directly measures a critical and often-overlooked failure point in LLMs: the gap between fluent synthesis and verifiable grounding. The validation of its "autorater" agai
However, I think the main weakness is the scalability of its own methodology. The creation of the expert-driven, question-specific rubrics is resource-intensive, and the paper notes that these rubrics may not generalize well to future models. This raises questions about how CLINB can be sustainably updated. This is the major concern that I have, especially for a dataset paper. the paper's conclusion that "familiarity bias" caused the disagreement between rater groups (Experts vs. Scientists) is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling · Multimodal Machine Learning Applications
