Multilingual training for Software Engineering
Toufique Ahmed, Premkumar Devanbu

TL;DR
This paper demonstrates that multilingual training data, leveraging similarities across programming languages, can enhance machine learning models for software engineering tasks like code summarization, retrieval, and naming.
Contribution
It introduces a novel approach of using cross-language similarities, especially in identifiers, to augment training data and improve model performance in software engineering tasks.
Findings
Multilingual code data shows high similarity, especially in identifiers.
Using multilingual data improves performance across tasks.
Identifier patterns are crucial for training data effectiveness.
Abstract
Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks. Several SE tasks have all been subject to this approach, with performance gradually improving over the past several years with better models and training methods. More, and more diverse, clean, labeled data is better for training; but constructing good-quality datasets is time-consuming and challenging. Ways of augmenting the volume and diversity of clean, labeled data generally have wide applicability. For some languages (e.g., Ruby) labeled data is less abundant; in others (e.g., JavaScript) the available data maybe more focused on some application domains, and thus less diverse. As a way around such data bottlenecks, we present evidence suggesting that human-written code in different languages (which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
