Data Augmentation for Code Translation with Comparable Corpora and   Multiple References

Yiqing Xie; Atharva Naik; Daniel Fried; Carolyn Rose

arXiv:2311.00317·cs.CL·October 7, 2024·1 cites

Data Augmentation for Code Translation with Comparable Corpora and Multiple References

Yiqing Xie, Atharva Naik, Daniel Fried, Carolyn Rose

PDF

Open Access 1 Repo

TL;DR

This paper introduces two data augmentation methods for code translation that leverage comparable corpora and multiple references, significantly improving translation accuracy across Java, Python, and C++.

Contribution

It proposes novel techniques for creating comparable corpora and multiple references, enhancing training data for code translation models.

Findings

01

Improved CodeT5 translation accuracy by 7.5% CA@1

02

Generated diverse target translations filtered by unit tests

03

Enhanced training data with comparable corpora from natural language documentation

Abstract

One major challenge of translating code between programming languages is that parallel training data is often limited. To overcome this challenge, we present two data augmentation techniques, one that builds comparable corpora (i.e., code pairs with similar functionality), and another that augments existing parallel data with multiple reference translations. Specifically, we build and analyze multiple types of comparable corpora, including programs generated from natural language documentation using a code generation model. Furthermore, to reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data and filter the translations by unit tests, which increases variation in target translations. Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

veronicium/cmtrans
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research

MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dense Connections · Layer Normalization · Attention Dropout · SentencePiece