Vibe Coding on Trial: Operating Characteristics of Unanimous LLM Juries
Muhammad Aziz Ullah, Abdul Serwadda

TL;DR
This paper evaluates the use of unanimous large language model juries to improve the safety and accuracy of code generation validation, demonstrating that small committees of strong models can reduce false accepts effectively.
Contribution
It introduces a novel approach of using unanimous LLM juries for code validation and analyzes how committee size and composition affect safety and accuracy.
Findings
Small unanimous committees reduce false accepts.
Committee composition significantly impacts performance.
Single models show uneven judgment quality.
Abstract
Large Language Models (LLMs) are now good enough at coding that developers can describe intent in plain language and let the tool produce the first code draft, a workflow increasingly built into tools like GitHub Copilot, Cursor, and Replit. What is missing is a reliable way to tell which model written queries are safe to accept without sending everything to a human. We study the application of an LLM jury to run this review step. We first benchmark 15 open models on 82 MySQL text to SQL tasks using an execution grounded protocol to get a clean baseline of which models are strong. From the six best models we build unanimous committees of sizes 1 through 6 that see the prompt, schema, and candidate SQL and accept it only when every member says it is correct. This rule matches safety first deployments where false accepts are more costly than false rejects. We measure true positive rate,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Software Engineering Research · Logic, programming, and type systems
