FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents
Bill Yuchen Lin, Ying Sheng, Nguyen Vo, Sandeep Tata

TL;DR
FreeDOM is a neural architecture that effectively extracts structured information from web pages, generalizing across sites with minimal supervision and outperforming previous methods without relying on visual rendering features.
Contribution
The paper introduces a novel two-stage neural approach, FreeDOM, that learns DOM node representations and captures semantic relations, enabling transferability across unseen websites with limited training data.
Findings
Outperforms previous state-of-the-art by 3.7 F1 points on average
Does not require features over rendered pages or handcrafted heuristics
Generalizes well to unseen sites after training on few seed sites
Abstract
Extracting structured data from HTML documents is a long-studied problem with a broad range of applications like augmenting knowledge bases, supporting faceted search, and providing domain-specific experiences for key verticals like shopping and movies. Previous approaches have either required a small number of examples for each target site or relied on carefully handcrafted heuristics built over visual renderings of websites. In this paper, we present a novel two-stage neural approach, named FreeDOM, which overcomes both these limitations. The first stage learns a representation for each DOM node in the page by combining both the text and markup information. The second stage captures longer range distance and semantic relatedness using a relational neural network. By combining these stages, FreeDOM is able to generalize to unseen sites after training on a small number of seed sites…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
