TL;DR
This paper systematically analyzes neural code summarization models, highlighting the impact of evaluation metrics, pre-processing, and dataset characteristics, and provides guidelines and tools for more reliable future research.
Contribution
It offers an in-depth evaluation of current models, revealing overlooked factors affecting performance and proposing best practices and a toolbox for future research.
Findings
BLEU variants significantly influence evaluation results.
Pre-processing choices can alter performance by -18% to +25%.
Dataset characteristics impact model evaluation and ranking.
Abstract
Source code summaries are important for program comprehension and maintenance. However, there are plenty of programs with missing, outdated, or mismatched summaries. Recently, deep learning techniques have been exploited to automatically generate summaries for given code snippets. To achieve a profound understanding of how far we are from solving this problem and provide suggestions to future research, in this paper, we conduct a systematic and in-depth analysis of 5 state-of-the-art neural code summarization models on 6 widely used BLEU variants, 4 pre-processing operations and their combinations, and 3 widely used datasets. The evaluation results show that some important factors have a great influence on the model evaluation, especially on the performance of models and the ranking among the models. However, these factors might be easily overlooked. Specifically, (1) the BLEU metric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
