SustainableQA: A Comprehensive Question Answering Dataset for Corporate Sustainability and EU Taxonomy Reporting
Mohammed Ali, Abdelrahman Abdallah, Adam Jatowt

TL;DR
SustainableQA is a large, high-quality dataset of over 195,000 QA pairs from corporate sustainability reports, designed to improve question-answering systems for EU taxonomy compliance and sustainability transparency.
Contribution
We introduce SustainableQA, a scalable pipeline for generating and validating a comprehensive sustainability QA dataset from corporate reports, enhancing domain-specific NLP applications.
Findings
Fine-tuned models outperform larger state-of-the-art models on sustainability QA tasks.
The dataset enables effective training of knowledge assistants for complex sustainability data.
Automated validation improves data quality and relevance.
Abstract
The growing demand for corporate sustainability transparency, particularly under new regulations like the EU Taxonomy, necessitates precise data extraction from large, unstructured corporate reports, a task for which Large Language Models and Retrieval-RAG systems require high-quality, domain-specific question-answering datasets. To address this, we introduce SustainableQA, a novel dataset and a scalable pipeline that generates comprehensive QA pairs from corporate sustainability and annual reports by integrating semantic chunk classification, a hybrid span extraction pipeline, and a specialized table-to-paragraph transformation. To ensure high quality, the generation is followed by a novel automated assessment and refinement pipeline that systematically validates each QA pair for faithfulness and relevance, repairing or discarding low-quality entries. This results in a final, robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
