Modeling Vocabulary for Big Code Machine Learning

Hlib Babii; Andrea Janes; Romain Robbes

arXiv:1904.01873·cs.CL·April 4, 2019·25 cites

Modeling Vocabulary for Big Code Machine Learning

Hlib Babii, Andrea Janes, Romain Robbes

PDF

Open Access

TL;DR

This paper examines key decisions in modeling source code vocabulary for machine learning, demonstrating how these choices impact model training and performance on a large-scale corpus of open-source projects.

Contribution

It identifies critical vocabulary modeling decisions and evaluates their effects, enabling efficient training of neural language models on extensive code datasets.

Findings

01

Certain vocabulary choices are decisive for model training success.

02

Optimized vocabulary modeling enables rapid training on large code corpora.

03

Effective vocabulary decisions improve neural language model performance.

Abstract

When building machine learning models that operate on source code, several decisions have to be made to model source-code vocabulary. These decisions can have a large impact: some can lead to not being able to train models at all, others significantly affect performance, particularly for Neural Language Models. Yet, these decisions are not often fully described. This paper lists important modeling choices for source code vocabulary, and explores their impact on the resulting vocabulary on a large-scale corpus of 14,436 projects. We show that a subset of decisions have decisive characteristics, allowing to train accurate Neural Language Models quickly on a large corpus of 10,106 projects.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques