A Cohesive Distillation Architecture for Neural Language Models

Jan Philip Wahle

arXiv:2301.08130·cs.CL·January 31, 2023

A Cohesive Distillation Architecture for Neural Language Models

Jan Philip Wahle

PDF

Open Access

TL;DR

This paper introduces novel knowledge distillation techniques for neural language models that enhance performance and efficiency without increasing model size, emphasizing the importance of architecture and training methods over mere scale.

Contribution

It proposes two new methods for knowledge distillation and lexical knowledge integration, improving natural language understanding without adding parameters.

Findings

01

KD with multiple teachers improves convergence

02

Lexical pre-training boosts NLU task performance

03

Enhanced semantic understanding benefits real-world applications

Abstract

A recent trend in Natural Language Processing is the exponential growth in Language Model (LM) size, which prevents research groups without a necessary hardware infrastructure from participating in the development process. This study investigates methods for Knowledge Distillation (KD) to provide efficient alternatives to large-scale models. In this context, KD means extracting information about language encoded in a Neural Network and Lexical Knowledge Databases. We developed two methods to test our hypothesis that efficient architectures can gain knowledge from LMs and extract valuable information from lexical sources. First, we present a technique to learn confident probability distribution for Masked Language Modeling by prediction weighting of multiple teacher networks. Second, we propose a method for Word Sense Disambiguation (WSD) and lexical KD that is general enough to be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsTest · Knowledge Distillation