LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation
Ngoc Phuoc An Vo, Brent Paulovicks, Vadim Sheinin

TL;DR
This paper introduces an enhanced LLM-based evaluation method for automatically validating and refining Bash scripts in IT automation, achieving high accuracy and significant improvements over baseline methods.
Contribution
It proposes a novel LLM-as-a-Judge approach with bidirectional matching and logic representation for reference-less code validation and refinement.
Findings
High agreement with execution-based evaluation (up to 8% improvement)
Reflection code agents improved accuracy by up to 24%
Effective automatic code validation and refinement in IT automation
Abstract
In an effort to automatically evaluate and select the best model and improve code quality for automatic incident remediation in IT Automation, it is crucial to verify if the generated code for remediation action is syntactically and semantically correct and whether it can be executed correctly as intended. There are three approaches: 1) conventional methods use surface form similarity metrics (token match, exact match, etc.) which have numerous limitations, 2) execution-based evaluation focuses more on code functionality based on pass/fail judgments for given test-cases, and 3) LLM-as-a-Judge employs LLMs for automated evaluation to judge if it is a correct answer for a given problem based on pre-defined metrics. In this work, we focused on enhancing LLM-as-a-Judge using bidirectional functionality matching and logic representation for reference-less automatic validation and refinement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSoftware Reliability and Analysis Research · Service-Oriented Architecture and Web Services · Model-Driven Software Engineering Techniques
