SDEC: Semantic Deep Embedded Clustering
Mohammad Wali Ur Rahman, Ric Nevarez, Lamia Tasnim Mim, Salim Hariri

TL;DR
SDEC is an innovative unsupervised text clustering framework that combines autoencoders and transformer embeddings to improve semantic preservation and clustering accuracy on large, complex textual datasets.
Contribution
It introduces a novel deep embedded clustering method that integrates semantic-aware autoencoders with transformer-based embeddings and a refinement stage for enhanced text clustering.
Findings
Achieved 85.7% accuracy on AG News dataset.
Set a new benchmark of 53.63% on Yahoo! Answers.
Demonstrated robust performance across five benchmark datasets.
Abstract
The high dimensional and semantically complex nature of textual Big data presents significant challenges for text clustering, which frequently lead to suboptimal groupings when using conventional techniques like k-means or hierarchical clustering. This work presents Semantic Deep Embedded Clustering (SDEC), an unsupervised text clustering framework that combines an improved autoencoder with transformer-based embeddings to overcome these challenges. This novel method preserves semantic relationships during data reconstruction by combining Mean Squared Error (MSE) and Cosine Similarity Loss (CSL) within an autoencoder. Furthermore, a semantic refinement stage that takes advantage of the contextual richness of transformer embeddings is used by SDEC to further improve a clustering layer with soft cluster assignments and distributional loss. The capabilities of SDEC are demonstrated by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
