Mimicking Word Embeddings using Subword RNNs

Yuval Pinter; Robert Guthrie; Jacob Eisenstein

arXiv:1707.06961·cs.CL·July 24, 2017

Mimicking Word Embeddings using Subword RNNs

Yuval Pinter, Robert Guthrie, Jacob Eisenstein

PDF

2 Repos

TL;DR

This paper introduces MIMICK, a novel method that generates embeddings for out-of-vocabulary words using subword RNNs, improving NLP task performance without retraining on original corpora.

Contribution

MIMICK is a new approach that composes OOV word embeddings from spellings, avoiding the need for re-training on the original embedding corpus.

Findings

01

Improves POS and morphosyntactic tagging across 23 languages.

02

Outperforms baseline word-based methods in OOV scenarios.

03

Competitive with supervised character-based models in low-resource settings.

Abstract

Word embeddings improve generalization over lexical features by placing each word in a lower-dimensional space, using distributional information obtained from unlabeled data. However, the effectiveness of word embeddings for downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which embeddings do not exist. In this paper, we present MIMICK, an approach to generating OOV word embeddings compositionally, by learning a function from spellings to distributional embeddings. Unlike prior work, MIMICK does not require re-training on the original word embedding corpus; instead, learning is performed at the type level. Intrinsic and extrinsic evaluations demonstrate the power of this simple approach. On 23 languages, MIMICK improves performance over a word-based baseline for tagging part-of-speech and morphosyntactic attributes. It is competitive with (and complementary to) a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.