TL;DR
This paper presents a comprehensive experimental study of objective-based hierarchical clustering on large-scale deep embedding datasets from vision and NLP, introducing a new scalable algorithm with improved performance and a theoretical approximation method.
Contribution
It introduces a new practical hierarchical clustering algorithm B++&C that improves clustering objectives and a theoretical algorithm B2SAT&C with a better approximation ratio for large datasets.
Findings
B++&C outperforms classic methods by 5-20% on clustering objectives.
B2SAT&C achieves a 0.74-approximation for CKMM objective in polynomial time.
The study covers datasets with up to 4.5 million entries from vision and NLP applications.
Abstract
We initiate a comprehensive experimental study of objective-based hierarchical clustering methods on massive datasets consisting of deep embedding vectors from computer vision and NLP applications. This includes a large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our study includes datasets with up to million entries with embedding dimensions up to . In order to address the challenge of scaling up hierarchical clustering to such large datasets we propose a new practical hierarchical clustering algorithm B++&C. It gives a 5%/20% improvement on average for the popular Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared to a wide range of classic methods and recent heuristics. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsAverage Pooling · Kaiming Initialization · Global Average Pooling · Batch Normalization · Residual Block · Residual Connection · *Communicated@Fast*How Do I Communicate to Expedia? · Convolution · 1x1 Convolution · Max Pooling
