Query-Key Normalization for Transformers
Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, Yuxuan Chen

TL;DR
This paper introduces QKNorm, a novel normalization method for Transformers that improves low-resource language translation by stabilizing attention mechanisms, leading to significant BLEU score improvements.
Contribution
QKNorm modifies the attention mechanism with $ ext{l}_2$ normalization and learnable scaling, enhancing translation quality in low-resource settings.
Findings
Average BLEU improvement of 0.928 over state-of-the-art models
Effective normalization technique for low-resource translation tasks
Improves attention stability without reducing model expressivity
Abstract
Low-resource language translation is a challenging but socially valuable NLP task. Building on recent work adapting the Transformer's normalization to this setting, we propose QKNorm, a normalization technique that modifies the attention mechanism to make the softmax function less prone to arbitrary saturation without sacrificing expressivity. Specifically, we apply normalization along the head dimension of each query and key matrix prior to multiplying them and then scale up by a learnable parameter instead of dividing by the square root of the embedding dimension. We show improvements averaging 0.928 BLEU over state-of-the-art bilingual benchmarks for 5 low-resource translation pairs from the TED Talks corpus and IWSLT'15.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗TehVenom/MPT-7b-Apache-2.0model· ♡ 2♡ 2
- 🤗gl198976/mpt-7bmodel· 173 dl· ♡ 2173 dl♡ 2
- 🤗gl198976/mpt-7b-instructmodel· 393 dl· ♡ 1393 dl♡ 1
- 🤗gouravsinha/finance-NERmodel· 2 dl· ♡ 62 dl♡ 6
- 🤗P1ayer-1/mpt-7b-instruct-basemodel· 140 dl· ♡ 2140 dl♡ 2
- 🤗TheBloke/MPT-7B-GGMLmodel· 6 dl· ♡ 216 dl♡ 21
- 🤗TheBloke/MPT-7B-Instruct-GGMLmodel· 129 dl· ♡ 30129 dl♡ 30
- 🤗Pratye/mpt-7b-chatmodel· 27 dl27 dl
- 🤗gretelai/mpt-7bmodel· 188 dl· ♡ 5188 dl♡ 5
- 🤗Birchlabs/mosaicml-mpt-7b-chat-qloramodel· 9 dl· ♡ 229 dl♡ 22
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies
MethodsSoftmax
