TL;DR
This paper reviews how NLP-based models for Software Engineering are evaluated, revealing a lack of standardization and proposing the need for a consistent evaluation methodology to enable fair comparisons.
Contribution
It highlights the inconsistency in evaluation protocols for NLP models in SE and emphasizes the necessity of a standardized assessment framework.
Findings
Current evaluations are inconsistent and lack standardization.
Metrics are often custom-defined and case-specific.
No widely-accepted evaluation protocol exists in the community.
Abstract
NLP-based models have been increasingly incorporated to address SE problems. These models are either employed in the SE domain with little to no change, or they are greatly tailored to source code and its unique characteristics. Many of these approaches are considered to be outperforming or complementing existing solutions. However, an important question arises here: "Are these models evaluated fairly and consistently in the SE community?". To answer this question, we reviewed how NLP-based models for SE problems are being evaluated by researchers. The findings indicate that currently there is no consistent and widely-accepted protocol for the evaluation of these models. While different aspects of the same task are being assessed in different studies, metrics are defined based on custom choices, rather than a system, and finally, answers are collected and interpreted case by case.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
