Learning code summarization from a small and local dataset
Toufique Ahmed, Premkumar Devanbu

TL;DR
This paper investigates the effectiveness of training code summarization models on small, project-specific datasets, demonstrating that a hybrid approach of pre-training on multiple projects followed by fine-tuning yields significant improvements.
Contribution
It introduces and evaluates a hybrid training approach that combines large-scale pre-training with project-specific fine-tuning for code summarization.
Findings
Hybrid approach outperforms state-of-the-art models across multiple projects.
Project-specific training improves performance over cross-project training.
Sample-efficient models benefit from combined pre-training and fine-tuning.
Abstract
Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of labeled examples, typically drawn from many projects. However, software phenomena can be very project-specific. Vocabulary, and other phenomena vary substantially with each project. Thus, training on project-specific data, and testing on the same project, is a promising idea. This hypothesis has to be evaluated carefully, e.g., in a time-series setting, to prevent training-test leakage. We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient (and thus prima facie well-suited for learning in a limited-sample same-project setting) and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Software Reliability and Analysis Research
MethodsCodeBERT
