Topics in Contextualised Attention Embeddings
Mozhgan Talebpour, Alba Garcia Seco de Herrera, Shoaib Jameel

TL;DR
This paper investigates how contextualised word embeddings from models like BERT implicitly form topical clusters, revealing the role of the attention mechanism in this process through probing experiments.
Contribution
It demonstrates that the attention framework in pre-trained language models is crucial for forming word topic clusters without explicit topic modeling.
Findings
Attention mechanisms are key to topical clustering in embeddings.
Clustering on contextual representations emulates latent topic structures.
Probing experiments reveal the implicit formation of word topics.
Abstract
Contextualised word vectors obtained via pre-trained language models encode a variety of knowledge that has already been exploited in applications. Complementary to these language models are probabilistic topic models that learn thematic patterns from the text. Recent work has demonstrated that conducting clustering on the word-level contextual representations from a language model emulates word clusters that are discovered in latent topics of words from Latent Dirichlet Allocation. The important question is how such topical word clusters are automatically formed, through clustering, in the language model when it has not been explicitly designed to model latent topics. To address this question, we design different probe experiments. Using BERT and DistilBERT, we find that the attention framework plays a key role in modelling such word topic clusters. We strongly believe that our work…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Adam · Residual Connection · Dense Connections · Layer Normalization · WordPiece · Attention Dropout · Weight Decay · Linear Layer
