Memorization and Generalization in Neural Code Intelligence Models
Md Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour and, Vincent J. Hellendoorn

TL;DR
This paper investigates how neural code models memorize noisy data versus generalize, revealing that large models can memorize anything, which may mislead their effectiveness in software engineering tasks.
Contribution
It provides the first quantification of memorization effects in neural code models, highlighting risks of overfitting noisy data in code intelligence systems.
Findings
Models memorize noisy data, risking false generalization.
Memorization occurs even in state-of-the-art models.
Large models can memorize anything, including noise.
Abstract
Deep Neural Networks (DNNs) are increasingly being used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, their large capacity can render them prone to memorizing data points. Recent work suggests that the memorization risk manifests especially strongly when the training dataset is noisy, involving many ambiguous or questionable samples, and memorization is the only recourse. The goal of this paper is to evaluate and compare the extent of memorization and generalization in neural code intelligence models. It aims to provide insights on how memorization may impact the learning behavior of neural models in code intelligence systems. To observe the extent of memorization in models, we add random noise to the original training dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Parallel Computing and Optimization Techniques · Software System Performance and Reliability
MethodsCodeBERT
