Improving Quality of Hierarchical Clustering for Large Data Series

Manuel R. Ciosici

arXiv:1608.01238·cs.CL·August 5, 2016·2 cites

Improving Quality of Hierarchical Clustering for Large Data Series

Manuel R. Ciosici

PDF

Open Access 1 Repo

TL;DR

This paper analyzes the limitations of Brown clustering, identifies weaknesses, and proposes two modifications to improve the quality of hierarchical clustering for large data series, with thorough evaluation of results.

Contribution

It provides a detailed analysis of Brown clustering's weaknesses and introduces two novel modifications to enhance cluster quality and reproducibility.

Findings

01

Identified key weaknesses in the original Brown clustering algorithm.

02

Proposed two modifications that improve cluster quality.

03

Thorough evaluation shows enhanced performance of the modified algorithm.

Abstract

Brown clustering is a hard, hierarchical, bottom-up clustering of words in a vocabulary. Words are assigned to clusters based on their usage pattern in a given corpus. The resulting clusters and hierarchical structure can be used in constructing class-based language models and for generating features to be used in NLP tasks. Because of its high computational cost, the most-used version of Brown clustering is a greedy algorithm that uses a window to restrict its search space. Like other clustering algorithms, Brown clustering finds a sub-optimal, but nonetheless effective, mapping of words to clusters. Because of its ability to produce high-quality, human-understandable cluster, Brown clustering has seen high uptake the NLP research community where it is used in the preprocessing and feature generation steps. Little research has been done towards improving the quality of Brown…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

manuelciosici/ExchangeAndBrown
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Bayesian Methods and Mixture Models · Data Mining Algorithms and Applications