A baseline for content-based blog classification

Olof Gornerup; Magnus Boman

arXiv:0909.4416·cs.IR·September 25, 2009

A baseline for content-based blog classification

Olof Gornerup, Magnus Boman

PDF

Open Access

TL;DR

This paper introduces a simple, computationally efficient content-based method for classifying blogs using word-overlap similarity, demonstrating effective clustering of blogs by topics in Swedish data.

Contribution

It presents a basic, transparent approach as a baseline for content-based blog classification, highlighting its effectiveness and hierarchical clustering capabilities.

Findings

01

Blogs on similar topics form distinct clusters

02

Clusters are hierarchically organized into higher-order groups

03

The method is computationally simple and transparent

Abstract

A content-based network representation of web logs (blogs) using a basic word-overlap similarity measure is presented. Due to a strong signal in blog data the approach is sufficient for accurately classifying blogs. Using Swedish blog data we demonstrate that blogs that treat similar subjects are organized in clusters that, in turn, are hierarchically organized in higher-order clusters. The simplicity of the representation renders it both computationally tractable and transparent. We therefore argue that the approach is suitable as a baseline when developing and analyzing more advanced content-based representations of the blogosphere.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Text and Document Classification Technologies · Advanced Text Analysis Techniques