FreeDOM: A Transferable Neural Architecture for Structured Information   Extraction on Web Documents

Bill Yuchen Lin; Ying Sheng; Nguyen Vo; Sandeep Tata

arXiv:2010.10755·cs.CL·October 22, 2020

FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

Bill Yuchen Lin, Ying Sheng, Nguyen Vo, Sandeep Tata

PDF

TL;DR

FreeDOM is a neural architecture that effectively extracts structured information from web pages, generalizing across sites with minimal supervision and outperforming previous methods without relying on visual rendering features.

Contribution

The paper introduces a novel two-stage neural approach, FreeDOM, that learns DOM node representations and captures semantic relations, enabling transferability across unseen websites with limited training data.

Findings

01

Outperforms previous state-of-the-art by 3.7 F1 points on average

02

Does not require features over rendered pages or handcrafted heuristics

03

Generalizes well to unseen sites after training on few seed sites

Abstract

Extracting structured data from HTML documents is a long-studied problem with a broad range of applications like augmenting knowledge bases, supporting faceted search, and providing domain-specific experiences for key verticals like shopping and movies. Previous approaches have either required a small number of examples for each target site or relied on carefully handcrafted heuristics built over visual renderings of websites. In this paper, we present a novel two-stage neural approach, named FreeDOM, which overcomes both these limitations. The first stage learns a representation for each DOM node in the page by combining both the text and markup information. The second stage captures longer range distance and semantic relatedness using a relational neural network. By combining these stages, FreeDOM is able to generalize to unseen sites after training on a small number of seed sites…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.