Self-supervised Document Clustering Based on BERT with Data Augment
Haoxiang Shi, Cen Wang

TL;DR
This paper introduces a self-supervised contrastive learning method based on BERT with data augmentation, significantly improving text clustering performance, especially for short texts, and approaching supervised learning accuracy.
Contribution
It proposes a novel self-supervised contrastive learning framework with unsupervised data augmentation for effective document clustering.
Findings
SCL outperforms existing unsupervised clustering methods.
FCL achieves near-supervised performance.
UDA enhances clustering accuracy for short texts.
Abstract
Contrastive learning is a promising approach to unsupervised learning, as it inherits the advantages of well-studied deep models without a dedicated and complex model design. In this paper, based on bidirectional encoder representations from transformers, we propose self-supervised contrastive learning (SCL) as well as few-shot contrastive learning (FCL) with unsupervised data augmentation (UDA) for text clustering. SCL outperforms state-of-the-art unsupervised clustering approaches for short texts and those for long texts in terms of several clustering evaluation measures. FCL achieves performance close to supervised learning, and FCL with UDA further improves the performance for short texts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis
MethodsContrastive Learning
