LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification   Testsuites

Zachariah Sollenberger; Jay Patel; Christian Munley; Aaron Jarmusch,; Sunita Chandrasekaran

arXiv:2408.11729·cs.SE·September 4, 2024

LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification Testsuites

Zachariah Sollenberger, Jay Patel, Christian Munley, Aaron Jarmusch,, Sunita Chandrasekaran

PDF

Open Access

TL;DR

This paper investigates using large language models as judges for validation and verification tests in software development, proposing an agent-based prompting approach to improve evaluation quality and address concerns about bias, confidentiality, and explainability.

Contribution

It introduces a novel approach of employing LLMs as evaluators for compiler tests and demonstrates how agent-based prompting enhances assessment accuracy.

Findings

01

Agent-based prompting improves evaluation quality

02

Validation pipeline structure enhances LLM performance

03

DeepSeek Coder's assessment accuracy increases with proposed methods

Abstract

Large Language Models (LLM) are evolving and have significantly revolutionized the landscape of software development. If used well, they can significantly accelerate the software development cycle. At the same time, the community is very cautious of the models being trained on biased or sensitive data, which can lead to biased outputs along with the inadvertent release of confidential information. Additionally, the carbon footprints and the un-explainability of these black box models continue to raise questions about the usability of LLMs. With the abundance of opportunities LLMs have to offer, this paper explores the idea of judging tests used to evaluate compiler implementations of directive-based programming models as well as probe into the black box of LLMs. Based on our results, utilizing an agent-based prompting approach and setting up a validation pipeline structure drastically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law