Effective Blog Pages Extractor for Better UGC Accessing

Kui Zhao; Yi Wang; Xia Hu; Can Wang

arXiv:1708.07935·cs.IR·August 29, 2017

Effective Blog Pages Extractor for Better UGC Accessing

Kui Zhao, Yi Wang, Xia Hu, Can Wang

PDF

TL;DR

This paper introduces a template-independent method for extracting main content from blog pages by converting pages into DOM-trees, extracting features, and using SVM classifiers, improving robustness and adaptability.

Contribution

The paper presents a novel, template-independent blog content extractor that uses DOM-tree analysis and machine learning, avoiding costly template development.

Findings

01

Effective extraction across diverse blog styles

02

High accuracy verified on 2,250 blog pages

03

Robust to template updates and variations

Abstract

Blog is becoming an increasingly popular media for information publishing. Besides the main content, most of blog pages nowadays also contain noisy information such as advertisements etc. Removing these unrelated elements can improves user experience, but also can better adapt the content to various devices such as mobile phones. Though template-based extractors are highly accurate, they may incur expensive cost in that a large number of template need to be developed and they will fail once the template is updated. To address these issues, we present a novel template-independent content extractor for blog pages. First, we convert a blog page into a DOM-Tree, where all elements including the title and body blocks in a page correspond to subtrees. Then we construct subtree candidate set for the title and the body blocks respectively, and extract both spatial and content features for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.