A Mutual Information Maximization Perspective of Language Representation Learning
Lingpeng Kong, Cyprien de Masson d'Autume, Wang Ling, Lei Yu, Zihang, Dai, Dani Yogatama

TL;DR
This paper presents a unified mutual information maximization framework for language representation learning, connecting classical and modern models, and introduces a new self-supervised task inspired by vision methods.
Contribution
It unifies various language embedding models under a mutual information perspective and proposes a novel self-supervised objective for improved sentence representation learning.
Findings
The framework provides theoretical insights into existing models.
A new self-supervised task based on mutual information is introduced.
The approach facilitates cross-domain transfer of representation learning techniques.
Abstract
We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsLinear Layer · Weight Decay · Residual Connection · Adam · Layer Normalization · Softmax · Attention Is All You Need · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention
