Automatic Detection of Webpages that Share the Same Web Template
Juli\'an Alarte (Universitat Polit\`ecnica de Val\`encia, Valencia,, Spain), David Insa (Universitat Polit\`ecnica de Val\`encia, Valencia,, Spain), Josep Silva (Universitat Polit\`ecnica de Val\`encia, Valencia,, Spain), Salvador Tamarit (Universidad Polit\'ecnica de Madrid

TL;DR
This paper presents a novel hyperlink analysis technique to automatically identify a minimal set of webpages sharing the same template within a website, facilitating efficient template extraction.
Contribution
It introduces a new method for discovering a small, high-confidence set of webpages with the same template using hyperlink analysis, reducing the need for extensive webpage analysis.
Findings
Effective identification of webpages sharing templates
Reduces the number of webpages needed for template extraction
High confidence in template grouping
Abstract
Template extraction is the process of isolating the template of a given webpage. It is widely used in several disciplines, including webpages development, content extraction, block detection, and webpages indexing. One of the main goals of template extraction is identifying a set of webpages with the same template without having to load and analyze too many webpages prior to identifying the template. This work introduces a new technique to automatically discover a reduced set of webpages in a website that implement the template. This set is computed with an hyperlink analysis that computes a very small set with a high level of confidence.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
