Page Segmentation using Visual Adjacency Analysis
Mohammad Bajammal, Ali Mesbah

TL;DR
This paper introduces a new web page segmentation method that combines DOM and visual analysis to improve accuracy, outperforming existing techniques significantly in precision and F-measure on real-world pages.
Contribution
A novel unsupervised page segmentation approach based on visual adjacency analysis that integrates DOM and visual features for enhanced accuracy.
Findings
156% increase in precision over state-of-the-art methods
249% improvement in F-measure
Effective on real-world web pages
Abstract
Page segmentation is a web page analysis process that divides a page into cohesive segments, such as sidebars, headers, and footers. Current page segmentation approaches use either the DOM, textual content, or rendering style information of the page. However, these approaches have a number of drawbacks, such as a large number of parameters and rigid assumptions about the page, which negatively impact their segmentation accuracy. We propose a novel page segmentation approach based on visual analysis of localized adjacency regions. It combines DOM attributes and visual analysis to build features of a given page and guide an unsupervised clustering. We evaluate our approach on 35 real-world web pages, and examine the effectiveness and efficiency of segmentation. The results show that, compared with state-of-the-art, our approach achieves an average of 156% increase in precision and 249%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Video Analysis and Summarization · Web Applications and Data Management
