Objective-Based Hierarchical Clustering of Deep Embedding Vectors

Stanislav Naumov; Grigory Yaroslavtsev; Dmitrii Avdiukhin

arXiv:2012.08466·cs.LG·June 10, 2022

Objective-Based Hierarchical Clustering of Deep Embedding Vectors

Stanislav Naumov, Grigory Yaroslavtsev, Dmitrii Avdiukhin

PDF

1 Video

TL;DR

This paper presents a comprehensive experimental study of objective-based hierarchical clustering on large-scale deep embedding datasets from vision and NLP, introducing a new scalable algorithm with improved performance and a theoretical approximation method.

Contribution

It introduces a new practical hierarchical clustering algorithm B++&C that improves clustering objectives and a theoretical algorithm B2SAT&C with a better approximation ratio for large datasets.

Findings

01

B++&C outperforms classic methods by 5-20% on clustering objectives.

02

B2SAT&C achieves a 0.74-approximation for CKMM objective in polynomial time.

03

The study covers datasets with up to 4.5 million entries from vision and NLP applications.

Abstract

We initiate a comprehensive experimental study of objective-based hierarchical clustering methods on massive datasets consisting of deep embedding vectors from computer vision and NLP applications. This includes a large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our study includes datasets with up to $4.5$ million entries with embedding dimensions up to $2048$ . In order to address the challenge of scaling up hierarchical clustering to such large datasets we propose a new practical hierarchical clustering algorithm B++&C. It gives a 5%/20% improvement on average for the popular Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared to a wide range of classic methods and recent heuristics. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Objective-Based Hierarchical Clustering of Deep Embedding Vectors· underline

Taxonomy

MethodsAverage Pooling · Kaiming Initialization · Global Average Pooling · Batch Normalization · Residual Block · Residual Connection · *Communicated@Fast*How Do I Communicate to Expedia? · Convolution · 1x1 Convolution · Max Pooling