KL Regularized Normalization Framework for Low Resource Tasks
Neeraj Kumar, Ankur Narang, Brejesh Lall

TL;DR
This paper introduces KL Regularized Normalization (KL-Norm), a novel technique that enhances normalization in low-resource NLP and speech tasks by improving generalization and reducing overfitting with minimal additional overhead.
Contribution
The paper proposes KL-Norm, a new normalization method that captures expressiveness better and improves low-resource task performance over existing normalization techniques.
Findings
KL-Norm outperforms other normalization methods in low-resource NLP and speech tasks.
It reduces overfitting and improves out-of-domain generalization.
KL-Norm adds negligible model parameters and memory overhead.
Abstract
Large pre-trained models, such as Bert, GPT, and Wav2Vec, have demonstrated great potential for learning representations that are transferable to a wide variety of downstream tasks . It is difficult to obtain a large quantity of supervised data due to the limited availability of resources and time. In light of this, a significant amount of research has been conducted in the area of adopting large pre-trained datasets for diverse downstream tasks via fine tuning, linear probing, or prompt tuning in low resource settings. Normalization techniques are essential for accelerating training and improving the generalization of deep neural networks and have been successfully used in a wide variety of applications. A lot of normalization techniques have been proposed but the success of normalization in low resource downstream NLP and speech tasks is limited. One of the reasons is the inability to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Machine Learning and Data Classification
MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Layer · Adam · Linear Warmup With Cosine Annealing · Softmax · Layer Normalization · Byte Pair Encoding · Dense Connections
