Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers

David Heye; Karl Kindermann; Robin Decker; Johannes Lohm\"oller; Anastasiia Belova; Sandra Geisler; Klaus Wehrle; Jan Pennekamp

arXiv:2603.06862·cs.CR·March 16, 2026

Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers

David Heye, Karl Kindermann, Robin Decker, Johannes Lohm\"oller, Anastasiia Belova, Sandra Geisler, Klaus Wehrle, Jan Pennekamp

PDF

Open Access

TL;DR

This paper explores how Large Language Models can enhance Artifact Evaluation in cybersecurity research by automating reproducibility assessments, environment setup, and pitfall detection, thereby improving efficiency and artifact quality.

Contribution

It introduces a novel LLM-based toolkit that automates key AE tasks, achieving over 72% accuracy in reproducibility ratings and high accuracy in pitfall detection.

Findings

01

Reproducibility rating accuracy exceeds 72%.

02

Automated environment setup for 28% of artifacts.

03

High accuracy (F1 > 92%) in detecting common pitfalls.

Abstract

Artifact Evaluation (AE) is essential for ensuring the transparency and reliability of research, closing the gap between exploratory work and real-world deployment is particularly important in cybersecurity, particularly in IoT and CPSs, where large-scale, heterogeneous, and privacy-sensitive data meet safety-critical actuation. Yet, manual reproducibility checks are time-consuming and do not scale with growing submission volumes. In this work, we demonstrate that Large Language Models (LLMs) can provide powerful support for AE tasks: (i) text-based reproducibility rating, (ii) autonomous sandboxed execution environment preparation, and (iii) assessment of methodological pitfalls. Our reproducibility-assessment toolkit yields an accuracy of over 72% and autonomously sets up execution environments for 28% of runnable cybersecurity artifacts. Our automated pitfall assessment detects seven…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Scientific Computing and Data Management · Software Engineering Research