CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation

Kaiwen Zhao; Bharathan Balaji; Stephen Lee

arXiv:2508.03489·cs.CL·August 6, 2025

CF-RAG: A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation

Kaiwen Zhao, Bharathan Balaji, Stephen Lee

PDF

TL;DR

This paper introduces CarbonPDF, a fine-tuned Llama 3-based method, and the CarbonPDF-QA dataset for improving question-answering on unstructured PDF sustainability reports about carbon footprints.

Contribution

It provides a new dataset and a specialized LLM-based approach to better handle unstructured, inconsistent PDF data for carbon footprint questions.

Findings

01

CarbonPDF outperforms existing QA systems.

02

GPT-4o struggles with data inconsistencies.

03

The dataset enables better model training.

Abstract

Product sustainability reports provide valuable insights into the environmental impacts of a product and are often distributed in PDF format. These reports often include a combination of tables and text, which complicates their analysis. The lack of standardization and the variability in reporting formats further exacerbate the difficulty of extracting and interpreting relevant information from large volumes of documents. In this paper, we tackle the challenge of answering questions related to carbon footprints within sustainability reports available in PDF format. Unlike previous approaches, our focus is on addressing the difficulties posed by the unstructured and inconsistent nature of text extracted from PDF parsing. To facilitate this analysis, we introduce CarbonPDF-QA, an open-source dataset containing question-answer pairs for 1735 product report documents, along with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.