Probing Pretrained Language Models for Lexical Semantics
Ivan Vuli\'c, Edoardo Maria Ponti, Robert Litschko, Goran Glava\v{s},, Anna Korhonen

TL;DR
This paper systematically investigates how pretrained language models encode lexical semantics across diverse languages and tasks, revealing patterns, best practices, and the distribution of lexical knowledge within the models.
Contribution
It provides a comprehensive empirical analysis of lexical knowledge extraction strategies and compares their effectiveness across languages and tasks, highlighting the distribution of lexical information in models.
Findings
Lower Transformer layers contain more lexical knowledge.
Lexical knowledge is distributed across multiple layers.
Best practices vary across languages and tasks.
Abstract
The success of large pretrained language models (LMs) such as BERT and RoBERTa has sparked interest in probing their representations, in order to unveil what types of knowledge they implicitly capture. While prior research focused on morphosyntactic, semantic, and world knowledge, it remains unclear to which extent LMs also derive lexical type-level knowledge from words in context. In this work, we present a systematic empirical analysis across six typologically diverse languages and five different lexical tasks, addressing the following questions: 1) How do different lexical knowledge extraction strategies (monolingual versus multilingual source LM, out-of-context versus in-context encoding, inclusion of special tokens, and layer-wise averaging) impact performance? How consistent are the observed effects across tasks and languages? 2) Is lexical knowledge stored in few parameters, or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · WordPiece · Adam · Byte Pair Encoding · Softmax · Multi-Head Attention · Layer Normalization · Dense Connections
