Dissecting Contextual Word Embeddings: Architecture and Representation
Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, Wen-tau Yih

TL;DR
This paper empirically compares different neural architectures for contextual word embeddings, revealing how they influence task performance and the nature of learned linguistic representations across layers.
Contribution
It provides a comprehensive analysis of how architecture choices affect the quality and properties of contextual embeddings in NLP tasks.
Findings
All architectures outperform static word embeddings.
Representations evolve from morphological to semantic with depth.
Tradeoff exists between model speed and accuracy.
Abstract
Contextual word representations derived from pre-trained bidirectional language models (biLMs) have recently been shown to provide significant improvements to the state of the art for a wide range of NLP tasks. However, many questions remain as to how and why these models are so effective. In this paper, we present a detailed empirical study of how the choice of neural architecture (e.g. LSTM, CNN, or self attention) influences both end task accuracy and qualitative properties of the representations that are learned. We show there is a tradeoff between speed and accuracy, but all architectures learn high quality contextual representations that outperform word embeddings for four challenging NLP tasks. Additionally, all architectures learn representations that vary with network depth, from exclusively morphological based at the word embedding layer through local syntax based in the lower…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Sigmoid Activation · Tanh Activation · Long Short-Term Memory
