Improving Quality of Hierarchical Clustering for Large Data Series
Manuel R. Ciosici

TL;DR
This paper analyzes the limitations of Brown clustering, identifies weaknesses, and proposes two modifications to improve the quality of hierarchical clustering for large data series, with thorough evaluation of results.
Contribution
It provides a detailed analysis of Brown clustering's weaknesses and introduces two novel modifications to enhance cluster quality and reproducibility.
Findings
Identified key weaknesses in the original Brown clustering algorithm.
Proposed two modifications that improve cluster quality.
Thorough evaluation shows enhanced performance of the modified algorithm.
Abstract
Brown clustering is a hard, hierarchical, bottom-up clustering of words in a vocabulary. Words are assigned to clusters based on their usage pattern in a given corpus. The resulting clusters and hierarchical structure can be used in constructing class-based language models and for generating features to be used in NLP tasks. Because of its high computational cost, the most-used version of Brown clustering is a greedy algorithm that uses a window to restrict its search space. Like other clustering algorithms, Brown clustering finds a sub-optimal, but nonetheless effective, mapping of words to clusters. Because of its ability to produce high-quality, human-understandable cluster, Brown clustering has seen high uptake the NLP research community where it is used in the preprocessing and feature generation steps. Little research has been done towards improving the quality of Brown…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Bayesian Methods and Mixture Models · Data Mining Algorithms and Applications
