A Fast Template-based Approach to Automatically Identify Primary Text Content of a Web Page
Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham, The Duy Bui

TL;DR
This paper introduces FastContentExtractor, a rapid template-based algorithm that efficiently identifies and extracts the main textual content from web pages, improving search relevance by filtering out non-informative blocks.
Contribution
It presents a novel, fast algorithm that leverages website templates to accurately and quickly extract primary content blocks from web pages, maintaining their original order.
Findings
Significantly faster content extraction compared to previous methods.
High accuracy in identifying main content blocks across diverse websites.
Maintains hierarchical order of extracted content.
Abstract
Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant Web pages. One reason is because search engines also look at non-informative blocks of Web pages such as advertisement, navigation links, etc. In this paper, we propose a fast algorithm called FastContentExtractor to automatically detect main content blocks in a Web page by improving the ContentExtractor algorithm. By automatically identifying and storing templates representing the structure of content blocks in a website, content blocks of a new Web page from the Website can be extracted quickly. The hierarchical order of the output blocks is also maintained which guarantees that the extracted content blocks are in the same order as the original ones.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
