WebFormer: The Web-page Transformer for Structure Information Extraction
Qifan Wang, Yi Fang, Anirudh Ravula, Fuli Feng, Xiaojun Quan, Dongfang, Liu

TL;DR
WebFormer is a novel transformer-based model that effectively captures web page layout and structure to improve extraction of structured information from web documents, outperforming existing methods.
Contribution
The paper introduces WebFormer, a transformer model that models web layout explicitly using HTML tokens and attention patterns, enhancing structure information extraction.
Findings
WebFormer achieves superior performance on SWDE and Common Crawl benchmarks.
The model effectively leverages web layout information for better token attention.
Experimental results outperform several state-of-the-art methods.
Abstract
Structure information extraction refers to the task of extracting structured text fields from web pages, such as extracting a product offer from a shopping page including product title, description, brand and price. It is an important research topic which has been widely studied in document understanding and web search. Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction. However, effectively serializing tokens from unstructured web pages is challenging in practice due to a variety of web layout patterns. Limited work has focused on modeling the web layout for extracting the text fields. In this paper, we introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents. First, we design HTML tokens for each DOM node in the HTML by embedding representations from their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · SAS software applications and methods
