An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets
Vijay Kumar Sutrakar, Nikhil Mogre

TL;DR
This paper introduces an enhanced clustering method for large text datasets that leverages fine-tuned word embeddings, advanced dimensionality reduction, and optimized algorithms to significantly improve clustering quality.
Contribution
The paper presents a novel improved clustering technique, WEClustering, incorporating fine-tuned contextual embeddings and optimization strategies for better large-scale text data clustering.
Findings
45% increase in median silhouette score for WEClustering_K++
67% increase in median silhouette score for WEClustering_A++
Significant improvements in clustering metrics like purity and ARI
Abstract
In this paper, an improved clustering technique for large textual datasets by leveraging fine-tuned word embeddings is presented. WEClustering technique is used as the base model. WEClustering model is fur-ther improvements incorporating fine-tuning contextual embeddings, advanced dimensionality reduction methods, and optimization of clustering algorithms. Experimental results on benchmark datasets demon-strate significant improvements in clustering metrics such as silhouette score, purity, and adjusted rand index (ARI). An increase of 45% and 67% of median silhouette score is reported for the proposed WE-Clustering_K++ (based on K-means) and WEClustering_A++ (based on Agglomerative models), respec-tively. The proposed technique will help to bridge the gap between semantic understanding and statistical robustness for large-scale text-mining tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsBalanced Selection
