LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation

Ngoc Phuoc An Vo; Brent Paulovicks; Vadim Sheinin

arXiv:2506.11237·cs.SE·June 16, 2025

LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation

Ngoc Phuoc An Vo, Brent Paulovicks, Vadim Sheinin

PDF

Open Access 1 Video

TL;DR

This paper introduces an enhanced LLM-based evaluation method for automatically validating and refining Bash scripts in IT automation, achieving high accuracy and significant improvements over baseline methods.

Contribution

It proposes a novel LLM-as-a-Judge approach with bidirectional matching and logic representation for reference-less code validation and refinement.

Findings

01

High agreement with execution-based evaluation (up to 8% improvement)

02

Reflection code agents improved accuracy by up to 24%

03

Effective automatic code validation and refinement in IT automation

Abstract

In an effort to automatically evaluate and select the best model and improve code quality for automatic incident remediation in IT Automation, it is crucial to verify if the generated code for remediation action is syntactically and semantically correct and whether it can be executed correctly as intended. There are three approaches: 1) conventional methods use surface form similarity metrics (token match, exact match, etc.) which have numerous limitations, 2) execution-based evaluation focuses more on code functionality based on pass/fail judgments for given test-cases, and 3) LLM-as-a-Judge employs LLMs for automated evaluation to judge if it is a correct answer for a given problem based on pre-defined metrics. In this work, we focused on enhancing LLM-as-a-Judge using bidirectional functionality matching and logic representation for reference-less automatic validation and refinement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation· underline

Taxonomy

TopicsSoftware Reliability and Analysis Research · Service-Oriented Architecture and Web Services · Model-Driven Software Engineering Techniques