Semi-Supervised Method using Gaussian Random Fields for Boilerplate Removal in Web Browsers
Joy Bose, Sumanta Mukherjee

TL;DR
This paper introduces a semi-supervised approach using Gaussian Random Fields to effectively remove boilerplate content from web pages, aiding various browser features like ad blocking and content summarization.
Contribution
It presents a novel semi-supervised method that models webpages as graphs and propagates labels based on similarity, reducing manual labeling effort for boilerplate removal.
Findings
Preliminary results show promising accuracy in boilerplate detection.
Graph-based semi-supervised learning reduces manual labeling effort.
Method can be integrated into web browser tools for improved content extraction.
Abstract
Boilerplate removal refers to the problem of removing noisy content from a webpage such as ads and extracting relevant content that can be used by various services. This can be useful in several features in web browsers such as ad blocking, accessibility tools such as read out loud, translation, summarization etc. In order to create a training dataset to train a model for boilerplate detection and removal, labeling or tagging webpage data manually can be tedious and time consuming. Hence, a semi-supervised model, in which some of the webpage elements are labeled manually and labels for others are inferred based on some parameters, can be useful. In this paper we present a solution for extraction of relevant content from a webpage that relies on semi-supervised learning using Gaussian Random Fields. We first represent the webpage as a graph, with text elements as nodes and the edge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
