Multilingual training for Software Engineering

Toufique Ahmed; Premkumar Devanbu

arXiv:2112.02043·cs.SE·February 4, 2022

Multilingual training for Software Engineering

Toufique Ahmed, Premkumar Devanbu

PDF

TL;DR

This paper demonstrates that multilingual training data, leveraging similarities across programming languages, can enhance machine learning models for software engineering tasks like code summarization, retrieval, and naming.

Contribution

It introduces a novel approach of using cross-language similarities, especially in identifiers, to augment training data and improve model performance in software engineering tasks.

Findings

01

Multilingual code data shows high similarity, especially in identifiers.

02

Using multilingual data improves performance across tasks.

03

Identifier patterns are crucial for training data effectiveness.

Abstract

Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks. Several SE tasks have all been subject to this approach, with performance gradually improving over the past several years with better models and training methods. More, and more diverse, clean, labeled data is better for training; but constructing good-quality datasets is time-consuming and challenging. Ways of augmenting the volume and diversity of clean, labeled data generally have wide applicability. For some languages (e.g., Ruby) labeled data is less abundant; in others (e.g., JavaScript) the available data maybe more focused on some application domains, and thus less diverse. As a way around such data bottlenecks, we present evidence suggesting that human-written code in different languages (which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.