A Cohesive Distillation Architecture for Neural Language Models
Jan Philip Wahle

TL;DR
This paper introduces novel knowledge distillation techniques for neural language models that enhance performance and efficiency without increasing model size, emphasizing the importance of architecture and training methods over mere scale.
Contribution
It proposes two new methods for knowledge distillation and lexical knowledge integration, improving natural language understanding without adding parameters.
Findings
KD with multiple teachers improves convergence
Lexical pre-training boosts NLU task performance
Enhanced semantic understanding benefits real-world applications
Abstract
A recent trend in Natural Language Processing is the exponential growth in Language Model (LM) size, which prevents research groups without a necessary hardware infrastructure from participating in the development process. This study investigates methods for Knowledge Distillation (KD) to provide efficient alternatives to large-scale models. In this context, KD means extracting information about language encoded in a Neural Network and Lexical Knowledge Databases. We developed two methods to test our hypothesis that efficient architectures can gain knowledge from LMs and extract valuable information from lexical sources. First, we present a technique to learn confident probability distribution for Masked Language Modeling by prediction weighting of multiple teacher networks. Second, we propose a method for Word Sense Disambiguation (WSD) and lexical KD that is general enough to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsTest · Knowledge Distillation
