Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents
Ramon Pires, F\'abio C. de Souza, Guilherme Rosa, Roberto A., Lotufo, Rodrigo Nogueira

TL;DR
This paper explores sequence-to-sequence models as a unified approach for extracting structured information from legal and registration documents, simplifying the pipeline and enabling easier system inspection.
Contribution
It introduces a joint extraction and generation model that replaces traditional token classification and rule-based post-processing, with a novel input-output alignment method.
Findings
Sequence-to-sequence models perform comparably to classical pipelines.
The approach reduces pipeline complexity and maintenance.
Alignment method improves system transparency.
Abstract
A typical information extraction pipeline consists of token- or span-level classification models coupled with a series of pre- and post-processing scripts. In a production pipeline, requirements often change, with classes being added and removed, which leads to nontrivial modifications to the source code and the possible introduction of bugs. In this work, we evaluate sequence-to-sequence models as an alternative to token-level classification methods for information extraction of legal and registration documents. We finetune models that jointly extract the information and generate the output already in a structured format. Post-processing steps are learned during training, thus eliminating the need for rule-based methods and simplifying the pipeline. Furthermore, we propose a novel method to align the output with the input text, thus facilitating system inspection and auditing. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Software Engineering Research · Web Application Security Vulnerabilities
