Content-driven, unsupervised clustering of news articles through multiscale graph partitioning
M. Tarik Altuncu, Sophia N. Yaliraki, Mauricio Barahona

TL;DR
This paper introduces an unsupervised, multiscale graph partitioning approach combining NLP embeddings and graph theory to cluster news articles by content, revealing hierarchical topic structures without prior assumptions.
Contribution
It presents a novel framework integrating deep neural text embeddings with multiscale community detection for content-driven news clustering.
Findings
Clusters align with meaningful content groups
Reveals hierarchical topic structures
Outperforms standard topic detection methods
Abstract
The explosion in the amount of news and journalistic content being generated across the globe, coupled with extended and instantaneous access to information through online media, makes it difficult and time-consuming to monitor news developments and opinion formation in real time. There is an increasing need for tools that can pre-process, analyse and classify raw text to extract interpretable content; specifically, identifying topics and content-driven groupings of articles. We present here such a methodology that brings together powerful vector embeddings from Natural Language Processing with tools from Graph Theory that exploit diffusive dynamics on graphs to reveal natural partitions across scales. Our framework uses a recent deep neural network text analysis methodology (Doc2vec) to represent text in vector form and then applies a multi-scale community detection method (Markov…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Network Analysis Techniques · Advanced Text Analysis Techniques · Text and Document Classification Technologies
