RepoFusion: Training Code Models to Understand Your Repository
Disha Shrivastava, Denis Kocetkov, Harm de Vries, Dzmitry Bahdanau,, Torsten Scholak

TL;DR
RepoFusion is a training framework that enables code models to better understand repository context, significantly improving code completion accuracy, especially for unseen or proprietary repositories, with resource releases for community use.
Contribution
The paper introduces RepoFusion, a novel training approach that incorporates repository context into code models, leading to substantial performance improvements over larger models.
Findings
Models trained with repository context outperform larger models in code completion.
RepoFusion achieves performance close to much larger models trained with fill-in-the-middle.
Extensive ablations validate the effectiveness of different context design choices.
Abstract
Despite the huge success of Large Language Models (LLMs) in coding assistants like GitHub Copilot, these models struggle to understand the context present in the repository (e.g., imports, parent classes, files with similar names, etc.), thereby producing inaccurate code completions. This effect is more pronounced when using these assistants for repositories that the model has not seen during training, such as proprietary software or work-in-progress code projects. Recent work has shown the promise of using context from the repository during inference. In this work, we extend this idea and propose RepoFusion, a framework to train models to incorporate relevant repository context. Experiments on single-line code completion show that our models trained with repository context significantly outperform much larger code models as CodeGen-16B-multi ( larger) and closely match…
Peer Reviews
Decision·Submitted to ICLR 2024
- The topic is timely and well motivated. - This paper proposed a repository-level training method that works well on hole completion tasks. - The writing is clear and self-contained. The paper presents experimental results to support the claims. - Authors open source dataset to facilitate future research.
- The novelty of the work is thin. At a high level, the work combines repository-level prompt generation and Fusion-in-Decoder (FiD) approaches. As a result, it is hard to justify this work as an ICLR paper. - The paper predominantly assesses the model's effectiveness in whole completion (single-line completions), which restricts its capacity to showcase its usefulness in real-world coding scenarios. To bolster its robustness and adaptability, it would be beneficial to subject RepoFusion to a br
1. The problem is an interesting one, and quite relevant, because IDE-based language models like Copilot are increasing in popularity, and they do need to learn to use repository context effectively. 1. The naive approach to solve this problem is to ask a language model to predict the missing tokens given surrounding context _within the same file_. The authors show that their algorithm performs much better than this naive approach, even using models that are much larger. This is a good empirica
1. I’m a little puzzled at the lack of comparison with _RLPG itself_, given that this paper references that approach so frequently, and that that approach has a similar motivation (although the downstream task might be slightly different). Isn’t it possible to compare against that approach directly, for instance, by using the RLPG classifier to select the best prompt proposal, and then appending this along with the rest of the context? If I understand correctly, your approach uses a fixed rankin
The paper is quite simple and the idea of concatenating contexts as part of the encoder input isn't alien to us (as previously studied in QA domain), but the paper executes pretty well in the software repo setting: 1. Studies of different methods to combine repo contexts, and recommendations of PPC + ranked lists as the best performing methods. 2. Comparison with many prior baselines to show the effectiveness. 3. Dataset creation that can benefit the community.
Given the paper's main contribution is the application of relatively known methods to a different domain, I would expect a bit deeper studies to show domain insights. Concretely, I have the following concerns: 1. The construction of the dataset performs random hole sampling: does the author check with how likely a hole have highly similar snippets in other files of the repo? If this is the case, then the high performance improvement may comes from the ability to "copy" from similar context, and
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Software System Performance and Reliability
