An Automated LLM-based Pipeline for Asset-Level Database Creation to Assess Deforestation Impact
Avanija Menon, Ovidiu Serban

TL;DR
This paper introduces an automated NLP pipeline utilizing LLMs and retrieval-augmented validation to create detailed asset-level databases for assessing deforestation impacts, improving accuracy and reliability over traditional methods.
Contribution
The study develops a novel end-to-end LLM-based data extraction pipeline with IRZ-CoT prompting and RAV validation, tailored for environmental impact assessment in high-risk sectors.
Findings
Significant accuracy improvements over zero-shot prompting.
Enhanced data validation through real-time web searches.
Effective application to SEC filings in multiple sectors.
Abstract
The European Union Deforestation Regulation (EUDR) requires companies to prove their products do not contribute to deforestation, creating a critical demand for precise, asset-level environmental impact data. Current databases lack the necessary detail, relying heavily on broad financial metrics and manual data collection, which limits regulatory compliance and accurate environmental modeling. This study presents an automated, end-to-end data extraction pipeline that uses LLMs to create, clean, and validate structured databases, specifically targeting sectors with a high risk of deforestation. The pipeline introduces Instructional, Role-Based, Zero-Shot Chain-of-Thought (IRZ-CoT) prompting to enhance data extraction accuracy and a Retrieval-Augmented Validation (RAV) process that integrates real-time web searches for improved data reliability. Applied to SEC EDGAR filings in the Mining,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsRegulation and Compliance Studies · Environmental and Social Impact Assessments · Corporate Social Responsibility Reporting
