Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code
Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton,, Andrea Janes

TL;DR
This paper introduces an open-vocabulary neural language model for source code that scales to large corpora and outperforms previous models, addressing the challenge of rapidly expanding vocabulary in code.
Contribution
It presents a scalable open-vocabulary model for source code and demonstrates its superior performance on large, diverse code datasets.
Findings
Model scales to 13,362 projects, 100 times larger than previous work.
Outperforms state-of-the-art models on Java, C, and Python datasets.
Largest neural language models for source code reported to date.
Abstract
Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques
