Who Evaluates the Evaluators? On Automatic Metrics for Assessing   AI-based Offensive Code Generators

Pietro Liguori; Cristina Improta; Roberto Natella; Bojan Cukic; and; Domenico Cotroneo

arXiv:2212.06008·cs.SE·April 14, 2023·1 cites

Who Evaluates the Evaluators? On Automatic Metrics for Assessing AI-based Offensive Code Generators

Pietro Liguori, Cristina Improta, Roberto Natella, Bojan Cukic, and, Domenico Cotroneo

PDF

Open Access

TL;DR

This paper evaluates various automatic similarity metrics for assessing AI-generated offensive code, comparing them with human judgment to identify their strengths and limitations.

Contribution

It analyzes the effectiveness of multiple output similarity metrics on offensive code generators and compares their estimates with human evaluations.

Findings

01

Certain metrics correlate well with human judgment

02

Some metrics are more suitable for specific code types

03

Automatic metrics have limitations in capturing code quality

Abstract

AI-based code generators are an emerging solution for automatically writing programs starting from descriptions in natural language, by using deep neural networks (Neural Machine Translation, NMT). In particular, code generators have been used for ethical hacking and offensive security testing by generating proof-of-concept attacks. Unfortunately, the evaluation of code generators still faces several issues. The current practice uses output similarity metrics, i.e., automatic metrics that compute the textual similarity of generated code with ground-truth references. However, it is not clear what metric to use, and which metric is most suitable for specific contexts. This work analyzes a large set of output similarity metrics on offensive code generators. We apply the metrics on two state-of-the-art NMT models using two datasets containing offensive assembly and Python code with their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Adversarial Robustness in Machine Learning · Advanced Malware Detection Techniques