Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair
Xinyue Liang, Jingxuan Zhang, Lin Li, Jun Zhang, Junhao Chen

TL;DR
This paper presents a multi-stage LLM training framework with error repair techniques to improve Java-to-Cangjie code translation, addressing low-resource challenges and enhancing semantic and structural accuracy.
Contribution
It introduces a novel multi-stage training and error repair approach that effectively translates Java to Cangjie with limited parallel data, improving semantic and structural correctness.
Findings
Improves functional equivalence by 6.06% over state-of-the-art methods.
Each training stage positively impacts translation performance.
Combines compiler feedback and error repair for better code correctness.
Abstract
With the rapid evolution of emerging programming language ecosystems, the demand for code translation to low-resource languages continues to grow. As Cangjie emerges as a new programming language, its ecosystem and development toolchains are rapidly expanding. Automated translation from popular programming languages to Cangjie is therefore valuable for practical development. However, constrained by both insufficient Cangjie knowledge and scarce parallel code corpora, general Large Language Models (LLMs) are prone to syntactic errors and semantic as well as structural misalignment in code translation. Existing approaches typically rely on fine-tuning with large-scale parallel data, but they cannot reliably improve compilability or semantic consistency for low-resource Cangjie languages. To tackle these challenges, we propose a multi-stage training framework of LLMs that employs the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
