SustainableQA: A Comprehensive Question Answering Dataset for Corporate Sustainability and EU Taxonomy Reporting

Mohammed Ali; Abdelrahman Abdallah; Adam Jatowt

arXiv:2508.03000·cs.IR·October 10, 2025

SustainableQA: A Comprehensive Question Answering Dataset for Corporate Sustainability and EU Taxonomy Reporting

Mohammed Ali, Abdelrahman Abdallah, Adam Jatowt

PDF

TL;DR

SustainableQA is a large, high-quality dataset of over 195,000 QA pairs from corporate sustainability reports, designed to improve question-answering systems for EU taxonomy compliance and sustainability transparency.

Contribution

We introduce SustainableQA, a scalable pipeline for generating and validating a comprehensive sustainability QA dataset from corporate reports, enhancing domain-specific NLP applications.

Findings

01

Fine-tuned models outperform larger state-of-the-art models on sustainability QA tasks.

02

The dataset enables effective training of knowledge assistants for complex sustainability data.

03

Automated validation improves data quality and relevance.

Abstract

The growing demand for corporate sustainability transparency, particularly under new regulations like the EU Taxonomy, necessitates precise data extraction from large, unstructured corporate reports, a task for which Large Language Models and Retrieval-RAG systems require high-quality, domain-specific question-answering datasets. To address this, we introduce SustainableQA, a novel dataset and a scalable pipeline that generates comprehensive QA pairs from corporate sustainability and annual reports by integrating semantic chunk classification, a hybrid span extraction pipeline, and a specialized table-to-paragraph transformation. To ensure high quality, the generation is followed by a novel automated assessment and refinement pipeline that systematically validates each QA pair for faithfulness and relevance, repairing or discarding low-quality entries. This results in a final, robust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.