Memorization and Generalization in Neural Code Intelligence Models

Md Rafiqul Islam Rabin; Aftab Hussain; Mohammad Amin Alipour and; Vincent J. Hellendoorn

arXiv:2106.08704·cs.LG·September 15, 2022

Memorization and Generalization in Neural Code Intelligence Models

Md Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour and, Vincent J. Hellendoorn

PDF

Open Access 2 Repos

TL;DR

This paper investigates how neural code models memorize noisy data versus generalize, revealing that large models can memorize anything, which may mislead their effectiveness in software engineering tasks.

Contribution

It provides the first quantification of memorization effects in neural code models, highlighting risks of overfitting noisy data in code intelligence systems.

Findings

01

Models memorize noisy data, risking false generalization.

02

Memorization occurs even in state-of-the-art models.

03

Large models can memorize anything, including noise.

Abstract

Deep Neural Networks (DNNs) are increasingly being used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, their large capacity can render them prone to memorizing data points. Recent work suggests that the memorization risk manifests especially strongly when the training dataset is noisy, involving many ambiguous or questionable samples, and memorization is the only recourse. The goal of this paper is to evaluate and compare the extent of memorization and generalization in neural code intelligence models. It aims to provide insights on how memorization may impact the learning behavior of neural models in code intelligence systems. To observe the extent of memorization in models, we add random noise to the original training dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Parallel Computing and Optimization Techniques · Software System Performance and Reliability

MethodsCodeBERT