Effective Blog Pages Extractor for Better UGC Accessing
Kui Zhao, Yi Wang, Xia Hu, Can Wang

TL;DR
This paper introduces a template-independent method for extracting main content from blog pages by converting pages into DOM-trees, extracting features, and using SVM classifiers, improving robustness and adaptability.
Contribution
The paper presents a novel, template-independent blog content extractor that uses DOM-tree analysis and machine learning, avoiding costly template development.
Findings
Effective extraction across diverse blog styles
High accuracy verified on 2,250 blog pages
Robust to template updates and variations
Abstract
Blog is becoming an increasingly popular media for information publishing. Besides the main content, most of blog pages nowadays also contain noisy information such as advertisements etc. Removing these unrelated elements can improves user experience, but also can better adapt the content to various devices such as mobile phones. Though template-based extractors are highly accurate, they may incur expensive cost in that a large number of template need to be developed and they will fail once the template is updated. To address these issues, we present a novel template-independent content extractor for blog pages. First, we convert a blog page into a DOM-Tree, where all elements including the title and body blocks in a page correspond to subtrees. Then we construct subtree candidate set for the title and the body blocks respectively, and extract both spatial and content features for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
