Maybe Deep Neural Networks are the Best Choice for Modeling Source Code
Rafael-Michael Karampatsis, Charles Sutton

TL;DR
This paper introduces a novel open-vocabulary neural language model for source code that segments code into subword units, outperforming existing models and adapting dynamically to new projects across multiple programming languages.
Contribution
The authors present the first open-vocabulary neural language model for code using subword segmentation, achieving state-of-the-art results and demonstrating scalability to large, multilingual code corpora.
Findings
Outperforms previous state-of-the-art models in code prediction tasks.
Effectively adapts to new projects, improving performance.
Handles over a billion tokens across three programming languages.
Abstract
Statistical language modeling techniques have successfully been applied to source code, yielding a variety of new software development tools, such as tools for code suggestion and improving readability. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. But traditional language models limit the vocabulary to a fixed set of common words. For code, this strong assumption has been shown to have a significant negative effect on predictive performance. But the open vocabulary version of the neural network language models for code have not been introduced in the literature. We present a new open-vocabulary neural language model for code that is not limited to a fixed vocabulary of identifier names. We employ a segmentation into subword units, subsequences of tokens chosen based on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques
