Data Augmentation for Code Translation with Comparable Corpora and Multiple References
Yiqing Xie, Atharva Naik, Daniel Fried, Carolyn Rose

TL;DR
This paper introduces two data augmentation methods for code translation that leverage comparable corpora and multiple references, significantly improving translation accuracy across Java, Python, and C++.
Contribution
It proposes novel techniques for creating comparable corpora and multiple references, enhancing training data for code translation models.
Findings
Improved CodeT5 translation accuracy by 7.5% CA@1
Generated diverse target translations filtered by unit tests
Enhanced training data with comparable corpora from natural language documentation
Abstract
One major challenge of translating code between programming languages is that parallel training data is often limited. To overcome this challenge, we present two data augmentation techniques, one that builds comparable corpora (i.e., code pairs with similar functionality), and another that augments existing parallel data with multiple reference translations. Specifically, we build and analyze multiple types of comparable corpora, including programs generated from natural language documentation using a code generation model. Furthermore, to reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data and filter the translations by unit tests, which increases variation in target translations. Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research
MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dense Connections · Layer Normalization · Attention Dropout · SentencePiece
