Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation
Le Chen, Nuo Xu, Winson Chen, Bin Lei, Pei-Hung Lin, Dunzhi Zhou, Rajeev Thakur, Caiwen Ding, Ali Jannesari, and Chunhua Liao

TL;DR
This paper introduces a novel dialogue-based data generation method for low-resource code translation tasks, significantly improving the performance of language models on challenging programming language pairs.
Contribution
It presents a dual-LLM pipeline that generates verified code translations and reasoning dialogues, enhancing low-resource code translation datasets and model performance.
Findings
Generated thousands of dialogues for Fortran-C++ and C++-CUDA translations.
Fine-tuning on this data improves unit test success rates by over 56%.
A 7B model outperforms larger proprietary systems on key metrics.
Abstract
Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency, and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran -> C++ and C++ -> CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research
