TL;DR
This paper presents Patched RTC, an innovative self-evaluation method for LLMs in software development tasks, demonstrating its effectiveness in assessing model performance and guiding prompt improvements without human input.
Contribution
It introduces Patched RTC, a versatile, self-evaluating framework for LLMs that correlates with task accuracy and enhances evaluation transparency in software workflows.
Findings
Patched RTC scores correlate with task-specific accuracy.
GPT-4 outperforms GPT-3.5 in software tasks using Patched RTC.
Consistency prompts improve model accuracy.
Abstract
This paper introduces Patched Round-Trip Correctness (Patched RTC), a novel evaluation technique for Large Language Models (LLMs) applied to diverse software development tasks, particularly focusing on "outer loop" activities such as bug fixing, code review, and documentation updates. Patched RTC extends the original Round-Trip Correctness method to work with any LLM and downstream task, offering a self-evaluating framework that measures consistency and robustness of model responses without human intervention. The study demonstrates a correlation between Patched RTC scores and task-specific accuracy metrics, presenting it as an alternative to the LLM-as-Judge paradigm for open-domain task evaluation. We implement Patched RTC in an open-source framework called patchwork, allowing for transparent evaluation during inference across various patchflows. Experiments comparing GPT-3.5 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Linear Warmup With Cosine Annealing · Residual Connection · Dropout · Transformer · Adam
