Using Document Similarity Methods to create Parallel Datasets for Code Translation
Mayank Agarwal, Kartik Talamadupula, Fernando Martinez, Stephanie, Houde, Michael Muller, John Richards, Steven I Ross, Justin D. Weisz

TL;DR
This paper introduces a method to generate noisy parallel datasets for code translation using document similarity, enabling supervised learning for languages lacking parallel corpora, and demonstrates its effectiveness across multiple programming languages.
Contribution
The paper proposes a novel approach to create parallel datasets from monolingual code corpora using document similarity, facilitating supervised code translation without needing curated datasets.
Findings
Models trained on noisy datasets perform comparably to those trained on ground truth.
The method enables creation of parallel datasets for less-studied programming languages.
The approach expands the applicability of supervised code translation techniques.
Abstract
Translating source code from one programming language to another is a critical, time-consuming task in modernizing legacy applications and codebases. Recent work in this space has drawn inspiration from the software naturalness hypothesis by applying natural language processing techniques towards automating the code translation task. However, due to the paucity of parallel data in this domain, supervised techniques have only been applied to a limited set of popular programming languages. To bypass this limitation, unsupervised neural machine translation techniques have been proposed to learn code translation using only monolingual corpora. In this work, we propose to use document similarity methods to create noisy parallel datasets of code, thus enabling supervised techniques to be applied for automated code translation without having to rely on the availability or expensive curation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Natural Language Processing Techniques · Topic Modeling
