Boilerplate Detection via Semantic Classification of TextBlocks

Hao Zhang; Jie Wang

arXiv:2203.04467·cs.CL·March 10, 2022

Boilerplate Detection via Semantic Classification of TextBlocks

Hao Zhang, Jie Wang

PDF

Open Access

TL;DR

This paper introduces SemText, a hierarchical neural network that uses semantic representations to accurately detect boilerplate HTML across various webpage types, including out-of-domain data.

Contribution

The paper presents SemText, a novel semantic classification model for boilerplate detection that achieves state-of-the-art accuracy and robustness across multiple datasets.

Findings

01

SemText achieves state-of-the-art accuracy on news webpage datasets.

02

SemText effectively detects boilerplate on out-of-domain community Q&A webpages.

03

The model demonstrates robustness across different webpage types.

Abstract

We present a hierarchical neural network model called SemText to detect HTML boilerplate based on a novel semantic representation of HTML tags, class names, and text blocks. We train SemText on three published datasets of news webpages and fine-tune it using a small number of development data in CleanEval and GoogleTrends-2017. We show that SemText achieves the state-of-the-art accuracy on these datasets. We then demonstrate the robustness of SemText by showing that it also detects boilerplate effectively on out-of-domain community-based question-answer webpages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Spam and Phishing Detection · Advanced Malware Detection Techniques