HTML-LSTM: Information Extraction from HTML Tables in Web Pages using Tree-Structured LSTM
Kazuki Kawamura, Akihiro Yamamoto

TL;DR
This paper introduces a novel tree-structured LSTM-based method for extracting and integrating information from HTML tables across web pages, improving data retrieval by leveraging structural and linguistic features.
Contribution
The paper extends tree-structured LSTM to effectively extract and combine information from HTML tables with varying structures, enabling better web data integration.
Findings
Effective extraction of HTML table information demonstrated
Improved data retrieval from web pages shown
Method outperforms baseline approaches
Abstract
In this paper, we propose a novel method for extracting information from HTML tables with similar contents but with a different structure. We aim to integrate multiple HTML tables into a single table for retrieval of information containing in various Web pages. The method is designed by extending tree-structured LSTM, the neural network for tree-structured data, in order to extract information that is both linguistic and structural information of HTML data. We evaluate the proposed method through experiments using real data published on the WWW.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
