Information Extraction From Fiscal Documents Using LLMs

Vikram Aggarwal; Jay Kulkarni; Aditi Mascarenhas; Aakriti Narang; Siddarth Raman; Ajay Shah; Susan Thomas

arXiv:2511.10659·cs.CL·November 25, 2025

Information Extraction From Fiscal Documents Using LLMs

Vikram Aggarwal, Jay Kulkarni, Aditi Mascarenhas, Aakriti Narang, Siddarth Raman, Ajay Shah, Susan Thomas

PDF

Open Access

TL;DR

This paper introduces a novel LLM-based method for extracting and validating structured data from complex, multi-page fiscal documents, demonstrating high accuracy and robustness in hierarchical data processing.

Contribution

It presents a multi-stage pipeline leveraging LLMs, domain knowledge, and hierarchical validation to extract structured fiscal data from lengthy government documents.

Findings

01

High accuracy in data extraction from multi-page fiscal documents

02

Effective validation using hierarchical relationships within fiscal tables

03

Scalable approach applicable to developing country contexts

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in text comprehension, but their ability to process complex, hierarchical tabular data remains underexplored. We present a novel approach to extracting structured data from multi-page government fiscal documents using LLM-based techniques. Applied to annual fiscal documents from the State of Karnataka in India (200+ pages), our method achieves high accuracy through a multi-stage pipeline that leverages domain knowledge, sequential context, and algorithmic validation. A large challenge with traditional OCR methods is the inability to verify the accurate extraction of numbers. When applied to fiscal data, the inherent structure of fiscal tables, with totals at each level of the hierarchy, allows for robust internal validation of the extracted data. We use these hierarchical relationships to create multi-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Advanced Text Analysis Techniques · Natural Language Processing Techniques