Using the DOM Tree for Content Extraction
Sergio L\'opez (Universitat Polit\`ecnica de Val\`encia), Josep Silva, (Universitat Polit\`ecnica de Val\`encia), David Insa (Universitat, Polit\`ecnica de Val\`encia)

TL;DR
This paper introduces a DOM tree-based technique for extracting main webpage content by analyzing hierarchical relations, achieving high recall and precision, and improving cohesion of extracted content blocks.
Contribution
A novel content extraction method leveraging DOM tree structure to improve accuracy and cohesion of extracted webpage content.
Findings
Achieves high recall and precision in content extraction.
Produces very cohesive content blocks.
Utilizes DOM hierarchy for precise component relation analysis.
Abstract
The main information of a webpage is usually mixed between menus, advertisements, panels, and other not necessarily related information; and it is often difficult to automatically isolate this information. This is precisely the objective of content extraction, a research area of widely interest due to its many applications. Content extraction is useful not only for the final human user, but it is also frequently used as a preprocessing stage of different systems that need to extract the main content in a web document to avoid the treatment and processing of other useless information. Other interesting application where content extraction is particularly used is displaying webpages in small screens such as mobile phones or PDAs. In this work we present a new technique for content extraction that uses the DOM tree of the webpage to analyze the hierarchical relations of the elements in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Text and Document Classification Technologies · Algorithms and Data Compression
