Multi-Record Web Page Information Extraction From News Websites

Alexander Kustenkov; Maksim Varlamov; Alexander Yatskov

arXiv:2502.14625·cs.CL·February 21, 2025

Multi-Record Web Page Information Extraction From News Websites

Alexander Kustenkov, Maksim Varlamov, Alexander Yatskov

PDF

Open Access

TL;DR

This paper introduces a large-scale Russian dataset for extracting information from multi-record web pages, along with novel multi-stage extraction methods that outperform existing approaches.

Contribution

The paper presents the first Russian dataset for multi-record list pages and proposes new multi-stage extraction techniques utilizing MarkupLM.

Findings

01

Our dataset contains 13,120 web pages with diverse attributes.

02

Proposed methods demonstrate improved extraction accuracy.

03

Experiments validate the effectiveness of multi-stage strategies.

Abstract

In this paper, we focused on the problem of extracting information from web pages containing many records, a task of growing importance in the era of massive web data. Recently, the development of neural network methods has improved the quality of information extraction from web pages. Nevertheless, most of the research and datasets are aimed at studying detailed pages. This has left multi-record "list pages" relatively understudied, despite their widespread presence and practical significance. To address this gap, we created a large-scale, open-access dataset specifically designed for list pages. This is the first dataset for this task in the Russian language. Our dataset contains 13,120 web pages with news lists, significantly exceeding existing datasets in both scale and complexity. Our dataset contains attributes of various types, including optional and multi-valued, providing a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis