Self-supervised Document Clustering Based on BERT with Data Augment

Haoxiang Shi; Cen Wang

arXiv:2011.08523·cs.CL·September 20, 2021·6 cites

Self-supervised Document Clustering Based on BERT with Data Augment

Haoxiang Shi, Cen Wang

PDF

Open Access

TL;DR

This paper introduces a self-supervised contrastive learning method based on BERT with data augmentation, significantly improving text clustering performance, especially for short texts, and approaching supervised learning accuracy.

Contribution

It proposes a novel self-supervised contrastive learning framework with unsupervised data augmentation for effective document clustering.

Findings

01

SCL outperforms existing unsupervised clustering methods.

02

FCL achieves near-supervised performance.

03

UDA enhances clustering accuracy for short texts.

Abstract

Contrastive learning is a promising approach to unsupervised learning, as it inherits the advantages of well-studied deep models without a dedicated and complex model design. In this paper, based on bidirectional encoder representations from transformers, we propose self-supervised contrastive learning (SCL) as well as few-shot contrastive learning (FCL) with unsupervised data augmentation (UDA) for text clustering. SCL outperforms state-of-the-art unsupervised clustering approaches for short texts and those for long texts in terms of several clustering evaluation measures. FCL achieves performance close to supervised learning, and FCL with UDA further improves the performance for short texts.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis

MethodsContrastive Learning