Clustering Word Embeddings with Self-Organizing Maps. Application on LaRoSeDa -- A Large Romanian Sentiment Data Set
Anca Maria Tache, Mihaela Gaman, Radu Tudor Ionescu

TL;DR
This paper introduces LaRoSeDa, a large Romanian sentiment dataset, and explores clustering word embeddings with self-organizing maps, showing improved results over k-means and demonstrating generalization to text categorization tasks.
Contribution
The paper presents a new Romanian sentiment dataset and innovatively applies self-organizing maps for clustering word embeddings, outperforming k-means and generalizing to text categorization.
Findings
SOMs produce clusters closer to Zipf's law distribution.
SOM-based clustering improves sentiment classification accuracy.
Method generalizes well to Romanian text categorization.
Abstract
Romanian is one of the understudied languages in computational linguistics, with few resources available for the development of natural language processing tools. In this paper, we introduce LaRoSeDa, a Large Romanian Sentiment Data Set, which is composed of 15,000 positive and negative reviews collected from one of the largest Romanian e-commerce platforms. We employ two sentiment classification methods as baselines for our new data set, one based on low-level features (character n-grams) and one based on high-level features (bag-of-word-embeddings generated by clustering word embeddings with k-means). As an additional contribution, we replace the k-means clustering algorithm with self-organizing maps (SOMs), obtaining better results because the generated clusters of word embeddings are closer to the Zipf's law distribution, which is known to govern natural language. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methodsk-Means Clustering
