Learning code summarization from a small and local dataset

Toufique Ahmed; Premkumar Devanbu

arXiv:2206.00804·cs.SE·June 3, 2022·6 cites

Learning code summarization from a small and local dataset

Toufique Ahmed, Premkumar Devanbu

PDF

Open Access

TL;DR

This paper investigates the effectiveness of training code summarization models on small, project-specific datasets, demonstrating that a hybrid approach of pre-training on multiple projects followed by fine-tuning yields significant improvements.

Contribution

It introduces and evaluates a hybrid training approach that combines large-scale pre-training with project-specific fine-tuning for code summarization.

Findings

01

Hybrid approach outperforms state-of-the-art models across multiple projects.

02

Project-specific training improves performance over cross-project training.

03

Sample-efficient models benefit from combined pre-training and fine-tuning.

Abstract

Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of labeled examples, typically drawn from many projects. However, software phenomena can be very project-specific. Vocabulary, and other phenomena vary substantially with each project. Thus, training on project-specific data, and testing on the same project, is a promising idea. This hypothesis has to be evaluated carefully, e.g., in a time-series setting, to prevent training-test leakage. We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient (and thus prima facie well-suited for learning in a limited-sample same-project setting) and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software System Performance and Reliability · Software Reliability and Analysis Research

MethodsCodeBERT