Unsupervised Data Extraction from Computer-generated Documents with Single Line Formatting
Vladimir Bernstein, Andrei Afanassenkov

TL;DR
This paper introduces an unsupervised machine learning methodology for fully automatic data extraction from computer-generated documents with arbitrary formatting, reducing manual effort and human intervention.
Contribution
It presents a novel unsupervised approach that detects formatting patterns and hierarchical structures to automate data extraction from diverse document formats.
Findings
Successfully identifies repeating formatting patterns
Automatically configures data extraction tools
Reduces need for manual data processing
Abstract
Processing large amounts of data is an essential problem of the big data era. Most of the data exchange is done via direct communication (using APIs) and well-structured file formats (JSON, XML, EDI, etc.), but a significant portion of the data is transferred using arbitrary formatted computer-generated documents (such as invoices, purchase orders, financial reports, etc.), which require sophisticated processing and human intervention for data interpretation and extraction. The currently available solutions, ranging from manual data entry to low-level scripting and data extraction tools, are costly and require human intervention. This paper describes the principle methodology for unsupervised, fully automatic data extraction from a wide range of computer-generated documents, assuming that their formatting reflects the original structure of the data sources. The presented methodology…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Music and Audio Processing · Algorithms and Data Compression
