How BPE Affects Memorization in Transformers
Eugene Kharitonov, Marco Baroni, Dieuwke Hupkes

TL;DR
This paper investigates how the size of the subword vocabulary in Byte-Pair Encoding influences the memorization behavior of Transformer models, affecting their ability to memorize and reproduce training data.
Contribution
It reveals that larger BPE vocabularies increase memorization and vulnerability to attacks, highlighting the importance of vocabulary size in model training and deployment.
Findings
Larger BPE vocabularies lead to more memorization in Transformers.
Models with bigger vocabularies are more susceptible to membership inference attacks.
Increasing BPE vocabulary reduces sequence length, impacting memorization behavior.
Abstract
Training data memorization in NLP can both be beneficial (e.g., closed-book QA) and undesirable (personal data extraction). In any case, successful model training requires a non-trivial amount of memorization to store word spellings, various linguistic idiosyncrasies and common knowledge. However, little is known about what affects the memorization behavior of NLP models, as the field tends to focus on the equally important question of generalization. In this work, we demonstrate that the size of the subword vocabulary learned by Byte-Pair Encoding (BPE) greatly affects both ability and tendency of standard Transformer models to memorize training data, even when we control for the number of learned parameters. We find that with a large subword vocabulary size, Transformer models fit random mappings more easily and are more vulnerable to membership inference attacks. Similarly, given a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Dropout · Dense Connections · Label Smoothing · Softmax · Residual Connection
