Landmarks and Regions: A Robust Approach to Data Extraction
Suresh Parthasarathy, Lincy Pattanaik, Anirudh Khatry, Arun Iyer,, Arjun Radhakrishna, Sriram Rajamani, Mohammad Raza

TL;DR
This paper introduces a landmark-based data extraction method that improves robustness to format changes in semi-structured documents by focusing on small regions of interest, inspired by human document processing.
Contribution
The paper presents a novel landmark and region-based approach for data extraction, implemented in the LRSyn tool, enhancing robustness over traditional methods.
Findings
Robustness to format changes demonstrated in HTML and scanned documents
Effective extraction of specific fields like passenger info and prices
LRSyn outperforms traditional extraction methods in varied formats
Abstract
We propose a new approach to extracting data items or field values from semi-structured documents. Examples of such problems include extracting passenger name, departure time and departure airport from a travel itinerary, or extracting price of an item from a purchase receipt. Traditional approaches to data extraction use machine learning or program synthesis to process the whole document to extract the desired fields. Such approaches are not robust to format changes in the document, and the extraction process typically fails even if changes are made to parts of the document that are unrelated to the desired fields of interest. We propose a new approach to data extraction based on the concepts of landmarks and regions. Humans routinely use landmarks in manual processing of documents to zoom in and focus their attention on small regions of interest in the document. Inspired by this human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsEmirates Airlines Office in Dubai
