Multilingual Attribute Extraction from News Web Pages

Pavel Bedrin; Maksim Varlamov; Alexander Yatskov

arXiv:2502.02167·cs.CL·February 5, 2025

Multilingual Attribute Extraction from News Web Pages

Pavel Bedrin, Maksim Varlamov, Alexander Yatskov

PDF

Open Access

TL;DR

This paper develops and evaluates multilingual neural models for extracting news article attributes across six languages, improving upon existing tools and addressing language diversity in web page information extraction.

Contribution

It introduces a multilingual dataset and fine-tunes state-of-the-art models, demonstrating improved extraction performance over existing tools for news web pages in multiple languages.

Findings

01

Fine-tuned models outperform existing open-source tools.

02

Translation into English affects extraction quality.

03

Pre-trained multilingual models enhance attribute extraction.

Abstract

This paper addresses the challenge of automatically extracting attributes from news article web pages across multiple languages. Recent neural network models have shown high efficacy in extracting information from semi-structured web pages. However, these models are predominantly applied to domains like e-commerce and are pre-trained using English data, complicating their application to web pages in other languages. We prepared a multilingual dataset comprising 3,172 marked-up news web pages across six languages (English, German, Russian, Chinese, Korean, and Arabic) from 161 websites. The dataset is publicly available on GitHub. We fine-tuned the pre-trained state-of-the-art model, MarkupLM, to extract news attributes from these pages and evaluated the impact of translating pages into English on extraction quality. Additionally, we pre-trained another state-of-the-art model, DOM-LM, on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Web visibility and informetrics