A Mutual Information Maximization Perspective of Language Representation   Learning

Lingpeng Kong; Cyprien de Masson d'Autume; Wang Ling; Lei Yu; Zihang; Dai; Dani Yogatama

arXiv:1910.08350·cs.CL·November 27, 2019·81 cites

A Mutual Information Maximization Perspective of Language Representation Learning

Lingpeng Kong, Cyprien de Masson d'Autume, Wang Ling, Lei Yu, Zihang, Dai, Dani Yogatama

PDF

Open Access

TL;DR

This paper presents a unified mutual information maximization framework for language representation learning, connecting classical and modern models, and introduces a new self-supervised task inspired by vision methods.

Contribution

It unifies various language embedding models under a mutual information perspective and proposes a novel self-supervised objective for improved sentence representation learning.

Findings

01

The framework provides theoretical insights into existing models.

02

A new self-supervised task based on mutual information is introduced.

03

The approach facilitates cross-domain transfer of representation learning techniques.

Abstract

We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsLinear Layer · Weight Decay · Residual Connection · Adam · Layer Normalization · Softmax · Attention Is All You Need · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention