Who Evaluates the Evaluators? On Automatic Metrics for Assessing AI-based Offensive Code Generators
Pietro Liguori, Cristina Improta, Roberto Natella, Bojan Cukic, and, Domenico Cotroneo

TL;DR
This paper evaluates various automatic similarity metrics for assessing AI-generated offensive code, comparing them with human judgment to identify their strengths and limitations.
Contribution
It analyzes the effectiveness of multiple output similarity metrics on offensive code generators and compares their estimates with human evaluations.
Findings
Certain metrics correlate well with human judgment
Some metrics are more suitable for specific code types
Automatic metrics have limitations in capturing code quality
Abstract
AI-based code generators are an emerging solution for automatically writing programs starting from descriptions in natural language, by using deep neural networks (Neural Machine Translation, NMT). In particular, code generators have been used for ethical hacking and offensive security testing by generating proof-of-concept attacks. Unfortunately, the evaluation of code generators still faces several issues. The current practice uses output similarity metrics, i.e., automatic metrics that compute the textual similarity of generated code with ground-truth references. However, it is not clear what metric to use, and which metric is most suitable for specific contexts. This work analyzes a large set of output similarity metrics on offensive code generators. We apply the metrics on two state-of-the-art NMT models using two datasets containing offensive assembly and Python code with their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Adversarial Robustness in Machine Learning · Advanced Malware Detection Techniques
