Web Template Extraction Based on Hyperlink Analysis
Juli\'an Alarte (Universitat Polit\`ecnica de Val\`encia), David Insa, (Universitat Polit\`ecnica de Val\`encia), Josep Silva (Universitat, Polit\`ecnica de Val\`encia), Salvador Tamarit (Universidad Polit\'ecnica de, Madrid)

TL;DR
This paper introduces a new method for automatically extracting web templates by analyzing DOM tree similarities, which helps improve web indexing efficiency by filtering out irrelevant content like ads and banners.
Contribution
The paper presents a novel template extraction technique based on hyperlink analysis and DOM tree similarity, enhancing web crawling and indexing processes.
Findings
Effective template identification reduces irrelevant data in web indexing.
The method demonstrates high accuracy in extracting common webpage templates.
Experimental results validate the usefulness of the proposed approach.
Abstract
Web templates are one of the main development resources for website engineers. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information such as advertisements, menus, and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. In this work we propose a novel method for automatic template extraction that is based on similarity analysis between the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
