ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures
Tobias Schimanski, Jingwei Ni, Roberto Spacey, Nicola Ranger, Markus, Leippold

TL;DR
This paper introduces ClimRetrieve, a new dataset for evaluating domain-specific information retrieval from corporate climate disclosures, addressing a key gap in climate communication analysis.
Contribution
It creates a benchmark dataset with labeled climate-related questions and explores integrating expert knowledge into retrieval systems, highlighting current limitations.
Findings
Embedding-based retrieval with expert knowledge improves accuracy
Significant challenges remain in knowledge-intensive climate domains
Dataset enables targeted evaluation of climate information retrieval methods
Abstract
To handle the vast amounts of qualitative data produced in corporate climate communication, stakeholders increasingly rely on Retrieval Augmented Generation (RAG) systems. However, a significant gap remains in evaluating domain-specific information retrieval - the basis for answer generation. To address this challenge, this work simulates the typical tasks of a sustainability analyst by examining 30 sustainability reports with 16 detailed climate-related questions. As a result, we obtain a dataset with over 8.5K unique question-source-answer pairs labeled by different levels of relevance. Furthermore, we develop a use case with the dataset to investigate the integration of expert knowledge into information retrieval with embeddings. Although we show that incorporating expert knowledge works, we also outline the critical limitations of embeddings in knowledge-intensive downstream domains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsImpact of AI and Big Data on Business and Society
