A baseline for content-based blog classification
Olof Gornerup, Magnus Boman

TL;DR
This paper introduces a simple, computationally efficient content-based method for classifying blogs using word-overlap similarity, demonstrating effective clustering of blogs by topics in Swedish data.
Contribution
It presents a basic, transparent approach as a baseline for content-based blog classification, highlighting its effectiveness and hierarchical clustering capabilities.
Findings
Blogs on similar topics form distinct clusters
Clusters are hierarchically organized into higher-order groups
The method is computationally simple and transparent
Abstract
A content-based network representation of web logs (blogs) using a basic word-overlap similarity measure is presented. Due to a strong signal in blog data the approach is sufficient for accurately classifying blogs. Using Swedish blog data we demonstrate that blogs that treat similar subjects are organized in clusters that, in turn, are hierarchically organized in higher-order clusters. The simplicity of the representation renders it both computationally tractable and transparent. We therefore argue that the approach is suitable as a baseline when developing and analyzing more advanced content-based representations of the blogosphere.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Text and Document Classification Technologies · Advanced Text Analysis Techniques
