An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets

Vijay Kumar Sutrakar; Nikhil Mogre

arXiv:2502.16139·cs.LG·May 22, 2025

An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets

Vijay Kumar Sutrakar, Nikhil Mogre

PDF

TL;DR

This paper introduces an enhanced clustering method for large text datasets that leverages fine-tuned word embeddings, advanced dimensionality reduction, and optimized algorithms to significantly improve clustering quality.

Contribution

The paper presents a novel improved clustering technique, WEClustering, incorporating fine-tuned contextual embeddings and optimization strategies for better large-scale text data clustering.

Findings

01

45% increase in median silhouette score for WEClustering_K++

02

67% increase in median silhouette score for WEClustering_A++

03

Significant improvements in clustering metrics like purity and ARI

Abstract

In this paper, an improved clustering technique for large textual datasets by leveraging fine-tuned word embeddings is presented. WEClustering technique is used as the base model. WEClustering model is fur-ther improvements incorporating fine-tuning contextual embeddings, advanced dimensionality reduction methods, and optimization of clustering algorithms. Experimental results on benchmark datasets demon-strate significant improvements in clustering metrics such as silhouette score, purity, and adjusted rand index (ARI). An increase of 45% and 67% of median silhouette score is reported for the proposed WE-Clustering_K++ (based on K-means) and WEClustering_A++ (based on Agglomerative models), respec-tively. The proposed technique will help to bridge the gap between semantic understanding and statistical robustness for large-scale text-mining tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsBalanced Selection