TL;DR
This paper investigates the effectiveness of retrieval-augmented code generation across multiple programming languages, revealing challenges and factors influencing cross-lingual knowledge transfer.
Contribution
It constructs a multilingual dataset and systematically studies cross-lingual transfer, providing insights for designing better multilingual RACG systems.
Findings
Knowledge transfer across languages is complex and not straightforward.
Transfer effectiveness depends on linguistic similarity and training data diversity.
Limited reliance on natural language cues when using code-specific retrieval.
Abstract
Current research on large language models (LLMs) with retrieval-augmented code generation (RACG) has largely focused on single-language settings, leaving their cross-lingual effectiveness underexplored. Multilingual RACG systems are increasingly important for migrating and reusing code across programming languages (PLs), a common yet challenging task in modern software development. To systematically study cross-lingual code knowledge transfer in RACG, we construct a dataset covering 13 PLs with nearly 14K instances. Our experiments reveal three key insights: (1) Knowledge transfer in RACG across PLs is non-trivial even using direct injection. (2) RACG exhibits unequal cross-lingual knowledge transfer, and its efficacy depends on linguistic affinity of PL pair and diversity of LLM pretraining corpus. (3) RACG shows limited reliance on natural language information embedded in code when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
