Unsupervised Contextualized Document Representation

Ankur Gupta; Vivek Gupta

arXiv:2109.10509·cs.CL·September 23, 2021

Unsupervised Contextualized Document Representation

Ankur Gupta, Vivek Gupta

PDF

1 Repo

TL;DR

This paper introduces SCDV+BERT(ctxd), an unsupervised document representation method that combines contextualized BERT embeddings with soft clustering, outperforming previous models on classification, concept matching, and sentence similarity tasks, especially with limited data.

Contribution

The paper proposes SCDV+BERT(ctxd), a novel unsupervised document embedding technique that integrates contextualized BERT embeddings with soft clustering to better handle polysemy and context.

Findings

01

Outperforms original SCDV and pre-trained BERT on multiple classification datasets.

02

Effective in concept matching and sentence similarity tasks.

03

Excels in low-data and few-shot learning scenarios.

Abstract

Several NLP tasks need the effective representation of text documents. Arora et. al., 2017 demonstrate that simple weighted averaging of word vectors frequently outperforms neural models. SCDV (Mekala et. al., 2017) further extends this from sentences to documents by employing soft and sparse clustering over pre-computed word vectors. However, both techniques ignore the polysemy and contextual character of words. In this paper, we address this issue by proposing SCDV+BERT(ctxd), a simple and effective unsupervised representation that combines contextualized BERT (Devlin et al., 2019) based word embedding for word sense disambiguation with SCDV soft clustering approach. We show that our embeddings outperform original SCDV, pre-train BERT, and several other baselines on many classification datasets. We also demonstrate our embeddings effectiveness on other tasks, such as concept matching…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vgupta123/contextualize_scdv
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Softmax · WordPiece · Layer Normalization · Residual Connection · Linear Warmup With Linear Decay · Dropout · Attention Dropout