A Survey of Machine Learning for Big Code and Naturalness
Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, Charles Sutton

TL;DR
This survey reviews recent advances in machine learning models for source code, highlighting their design principles, applications, and challenges in leveraging code's naturalness and patterns.
Contribution
It provides a comprehensive taxonomy and analysis of probabilistic models for code, contrasting them with natural language models and discussing their applications.
Findings
Models exploit code patterns and naturalness.
Taxonomy categorizes models by design principles.
Identifies challenges and opportunities in applying ML to code.
Abstract
Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit code's abundance of patterns. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges and opportunities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
