Enhancing Cross-Language Code Translation via Task-Specific Embedding Alignment in Retrieval-Augmented Generation
Manish Bhattarai, Minh Vu, Javier E. Santos, Ismael Boureima, Daniel, O' Malley

TL;DR
This paper presents a task-specific embedding alignment method within a Retrieval-Augmented Generation framework to improve cross-language code translation from Fortran to C++, achieving significant quality improvements without fine-tuning the language model.
Contribution
It introduces a novel contrastive learning approach that aligns retrieval embeddings with translation quality, enhancing code translation performance in a retrieval-augmented setting.
Findings
CodeBLEU score improved from 0.64 to 0.73 on HPC Fortran2C++ dataset
CodeBLEU score increased from 0.52 to 0.60 on Numerical Recipes dataset
Achieved 14-15% relative improvement without fine-tuning the language model.
Abstract
We introduce a novel method to enhance cross-language code translation from Fortran to C++ by integrating task-specific embedding alignment into a Retrieval-Augmented Generation (RAG) framework. Unlike conventional retrieval approaches that utilize generic embeddings agnostic to the downstream task, our strategy aligns the retrieval model directly with the objective of maximizing translation quality, as quantified by the CodeBLEU metric. This alignment ensures that the embeddings are semantically and syntactically meaningful for the specific code translation task. Our methodology involves constructing a dataset of 25,000 Fortran code snippets sourced from Stack-V2 dataset and generating their corresponding C++ translations using the LLaMA 3.1-8B language model. We compute pairwise CodeBLEU scores between the generated translations and ground truth examples to capture fine-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsAttention Is All You Need · Softmax · Byte Pair Encoding · Linear Layer · Linear Warmup With Linear Decay · Multi-Head Attention · Weight Decay · WordPiece · Layer Normalization · Residual Connection
