Hierarchical Level-Wise News Article Clustering via Multilingual Matryoshka Embeddings
Hans W. A. Hanley, Zakir Durumeric

TL;DR
This paper introduces a scalable, interpretable, and multilingual hierarchical clustering method for news articles using novel Matryoshka embeddings that operate at multiple levels of granularity.
Contribution
The paper proposes a new multilingual Matryoshka embedding model and a hierarchical clustering algorithm that together improve scalability, interpretability, and performance in multilingual news clustering.
Findings
Achieved state-of-the-art performance on SemEval 2022 dataset (Pearson ρ = 0.816)
Effectively identifies stories, narratives, and themes in real-world datasets
Provides a scalable and interpretable approach for multilingual news clustering.
Abstract
Contextual large language model embeddings are increasingly utilized for topic modeling and clustering. However, current methods often scale poorly, rely on opaque similarity metrics, and struggle in multilingual settings. In this work, we present a novel, scalable, interpretable, hierarchical, and multilingual approach to clustering news articles and social media data. To do this, we first train multilingual Matryoshka embeddings that can determine story similarity at varying levels of granularity based on which subset of the dimensions of the embeddings is examined. This embedding model achieves state-of-the-art performance on the SemEval 2022 Task 8 test dataset (Pearson = 0.816). Once trained, we develop an efficient hierarchical clustering algorithm that leverages the hierarchical nature of Matryoshka embeddings to identify unique news stories, narratives, and themes. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Text Analysis Techniques · Topic Modeling · Complex Network Analysis Techniques
