Neurosymbolic Information Extraction from Transactional Documents

Arthur Hemmer; Micka\"el Coustaty; Nicola Bartolo; Jean-Marc Ogier

arXiv:2512.09666·cs.CL·December 11, 2025

Neurosymbolic Information Extraction from Transactional Documents

Arthur Hemmer, Micka\"el Coustaty, Nicola Bartolo, Jean-Marc Ogier

PDF

Open Access

TL;DR

This paper introduces a neurosymbolic framework for extracting information from transactional documents, combining language models with symbolic validation to improve accuracy and enable zero-shot learning.

Contribution

It presents a schema-based approach with symbolic validation, relabeled datasets, and high-quality label generation for knowledge distillation in transactional document extraction.

Findings

01

Significant improvements in F1-scores and accuracy

02

Effective zero-shot extraction enabled by symbolic validation

03

Enhanced knowledge distillation through high-quality labels

Abstract

This paper presents a neurosymbolic framework for information extraction from documents, evaluated on transactional documents. We introduce a schema-based approach that integrates symbolic validation methods to enable more effective zero-shot output and knowledge distillation. The methodology uses language models to generate candidate extractions, which are then filtered through syntactic-, task-, and domain-level validation to ensure adherence to domain-specific arithmetic constraints. Our contributions include a comprehensive schema for transactional documents, relabeled datasets, and an approach for generating high-quality labels for knowledge distillation. Experimental results demonstrate significant improvements in $F_{1}$ -scores and accuracy, highlighting the effectiveness of neurosymbolic validation in transactional document processing.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Handwritten Text Recognition Techniques · Natural Language Processing Techniques