Examining DOM Coordinate Effectiveness For Page Segmentation
Jason Carpenter, Faaiq Bilal, Eman Ramadan, Zhi-Li Zhang

TL;DR
This study critically evaluates the effectiveness of DOM coordinate-based vectors for web page segmentation, revealing that simpler vectors often outperform complex ones and that visual coordinates are less effective, leading to improved segmentation accuracy.
Contribution
It provides a detailed analysis of DOM coordinate vectors, challenging existing assumptions and demonstrating how matching vectors and algorithms can significantly enhance segmentation performance.
Findings
Visual coordinates underperform compared to DOM coordinates by 20-30%.
Simple single-coordinate vectors outperform complex vectors in segmentation tasks.
Properly matched vectors and algorithms achieve 74% segmentation accuracy, a 20% improvement.
Abstract
Web pages form a cornerstone of available data for daily human consumption and with the rise of LLM-based search and learning systems a treasure trove of valuable data. The scale of this data and its unstructured format still continue to grow requiring ever more robust automated extraction and retrieval mechanisms. Existing work, leveraging the web pages Document Object Model (DOM), often derives clustering vectors from coordinates informed by the DOM such as visual placement or tree structure. The construction and component value of these vectors often go unexamined. Our work proposes and examines DOM coordinates in a detail to understand their impact on web page segmentation. Our work finds that there is no one-size-fits-all vector, and that visual coordinates under-perform compared to DOM coordinates by about 20-30% on average. This challenges the necessity of including visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Advanced Image and Video Retrieval Techniques · Information Retrieval and Search Behavior
