Graph-based Topic Extraction from Vector Embeddings of Text Documents: Application to a Corpus of News Articles
M. Tarik Altuncu, Sophia N. Yaliraki, Mauricio Barahona

TL;DR
This paper introduces an unsupervised graph-based framework that combines NLP vector embeddings with multiscale graph partitioning to extract topics from large text corpora, demonstrated on 2016 US news articles.
Contribution
It presents a novel unsupervised method integrating vector embeddings and graph partitioning for multi-resolution topic detection without predefined cluster numbers.
Findings
Graph-based clustering outperforms traditional methods in topic coherence.
Transformer-based embeddings like BERT improve topic detection quality.
Multiscale analysis reveals hierarchical content structures.
Abstract
Production of news content is growing at an astonishing rate. To help manage and monitor the sheer amount of text, there is an increasing need to develop efficient methods that can provide insights into emerging content areas, and stratify unstructured corpora of text into `topics' that stem intrinsically from content similarity. Here we present an unsupervised framework that brings together powerful vector embeddings from natural language processing with tools from multiscale graph partitioning that can reveal natural partitions at different resolutions without making a priori assumptions about the number of clusters in the corpus. We show the advantages of graph-based clustering through end-to-end comparisons with other popular clustering and topic modelling methods, and also evaluate different text vector embeddings, from classic Bag-of-Words to Doc2Vec to the recent transformers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
