Unsupervised Data Extraction from Computer-generated Documents with   Single Line Formatting

Vladimir Bernstein; Andrei Afanassenkov

arXiv:2007.07082·cs.IR·July 17, 2020·1 cites

Unsupervised Data Extraction from Computer-generated Documents with Single Line Formatting

Vladimir Bernstein, Andrei Afanassenkov

PDF

Open Access

TL;DR

This paper introduces an unsupervised machine learning methodology for fully automatic data extraction from computer-generated documents with arbitrary formatting, reducing manual effort and human intervention.

Contribution

It presents a novel unsupervised approach that detects formatting patterns and hierarchical structures to automate data extraction from diverse document formats.

Findings

01

Successfully identifies repeating formatting patterns

02

Automatically configures data extraction tools

03

Reduces need for manual data processing

Abstract

Processing large amounts of data is an essential problem of the big data era. Most of the data exchange is done via direct communication (using APIs) and well-structured file formats (JSON, XML, EDI, etc.), but a significant portion of the data is transferred using arbitrary formatted computer-generated documents (such as invoices, purchase orders, financial reports, etc.), which require sophisticated processing and human intervention for data interpretation and extraction. The currently available solutions, ranging from manual data entry to low-level scripting and data extraction tools, are costly and require human intervention. This paper describes the principle methodology for unsupervised, fully automatic data extraction from a wide range of computer-generated documents, assuming that their formatting reflects the original structure of the data sources. The presented methodology…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTime Series Analysis and Forecasting · Music and Audio Processing · Algorithms and Data Compression