HTML-LSTM: Information Extraction from HTML Tables in Web Pages using   Tree-Structured LSTM

Kazuki Kawamura; Akihiro Yamamoto

arXiv:2409.19445·cs.IR·October 1, 2024

HTML-LSTM: Information Extraction from HTML Tables in Web Pages using Tree-Structured LSTM

Kazuki Kawamura, Akihiro Yamamoto

PDF

TL;DR

This paper introduces a novel tree-structured LSTM-based method for extracting and integrating information from HTML tables across web pages, improving data retrieval by leveraging structural and linguistic features.

Contribution

The paper extends tree-structured LSTM to effectively extract and combine information from HTML tables with varying structures, enabling better web data integration.

Findings

01

Effective extraction of HTML table information demonstrated

02

Improved data retrieval from web pages shown

03

Method outperforms baseline approaches

Abstract

In this paper, we propose a novel method for extracting information from HTML tables with similar contents but with a different structure. We aim to integrate multiple HTML tables into a single table for retrieval of information containing in various Web pages. The method is designed by extending tree-structured LSTM, the neural network for tree-structured data, in order to extract information that is both linguistic and structural information of HTML data. We evaluate the proposed method through experiments using real data published on the WWW.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory