Dependency Parsing is More Parameter-Efficient with Normalization
Paolo Gajo, Domenic Rosati, Hassan Sajjad, Alberto Barr\'on-Cede\~no

TL;DR
This paper demonstrates that applying normalization to biaffine scoring in dependency parsing models reduces overparameterization, improves efficiency, and achieves state-of-the-art results across multiple languages and tasks.
Contribution
The paper provides theoretical and empirical evidence that score normalization in biaffine parsing models enhances parameter efficiency and performance.
Findings
Normalization reduces model overparameterization.
Normalized models achieve state-of-the-art accuracy.
Normalization improves training efficiency across tasks.
Abstract
Dependency parsing is the task of inferring natural language structure, often approached by modeling word interactions via attention through biaffine scoring. This mechanism works like self-attention in Transformers, where scores are calculated for every pair of words in a sentence. However, unlike Transformer attention, biaffine scoring does not use normalization prior to taking the softmax of the scores. In this paper, we provide theoretical evidence and empirical results revealing that a lack of normalization necessarily results in overparameterized parser models, where the extra parameters compensate for the sharp softmax outputs produced by high variance inputs to the biaffine scoring function. We argue that biaffine scoring can be made substantially more efficient by performing score normalization. We conduct experiments on semantic and syntactic dependency parsing in multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing
