Active Code Learning: Benchmarking Sample-Efficient Training of Code   Models

Qiang Hu; Yuejun Guo; Xiaofei Xie; Maxime Cordy; Lei Ma; Mike; Papadakis; and Yves Le Traon

arXiv:2306.01250·cs.SE·June 5, 2023·1 cites

Active Code Learning: Benchmarking Sample-Efficient Training of Code Models

Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Lei Ma, Mike, Papadakis, and Yves Le Traon

PDF

Open Access

TL;DR

This paper introduces the first benchmark for active learning in code models, evaluating various acquisition functions and revealing key factors affecting sample efficiency and performance in code-related tasks.

Contribution

It builds a comprehensive benchmark for active code learning, adapting acquisition functions for code tasks and analyzing their effectiveness and influencing factors.

Findings

01

Feature selection significantly impacts active learning performance.

02

Output vector-based data selection outperforms other methods.

03

Active learning shows limited effectiveness in code summarization tasks.

Abstract

The costly human effort required to prepare the training data of machine learning (ML) models hinders their practical development and usage in software engineering (ML4Code), especially for those with limited budgets. Therefore, efficiently training models of code with less human effort has become an emergent problem. Active learning is such a technique to address this issue that allows developers to train a model with reduced data while producing models with desired performance, which has been well studied in computer vision and natural language processing domains. Unfortunately, there is no such work that explores the effectiveness of active learning for code models. In this paper, we bridge this gap by building the first benchmark to study this critical problem - active code learning. Specifically, we collect 11 acquisition functions~(which are used for data selection in active…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Software Engineering Research · Software Testing and Debugging Techniques