CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language   Models to Coding Preferences

Martin Weyssow; Aton Kamanda; Xin Zhou; and Houari Sahraoui

arXiv:2403.09032·cs.SE·December 30, 2024·1 cites

CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences

Martin Weyssow, Aton Kamanda, Xin Zhou, and Houari Sahraoui

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper introduces CodeUltraFeedback, a dataset for evaluating and improving LLMs' alignment with coding preferences using GPT-3.5 as a judge, and demonstrates how this data can enhance model alignment and correctness.

Contribution

It presents a new dataset and methodology for aligning LLMs with coding preferences, utilizing GPT-3.5 as a judge and applying this to fine-tune CodeLlama-7B-Instruct.

Findings

01

GPT-3.5 and GPT-4 responses are generally preferred over open-weight LLMs.

02

Using CodeUltraFeedback improves the alignment of CodeLlama-7B-Instruct with coding preferences.

03

Aligned CodeLlama-7B-Instruct outperforms larger LLMs on the HumanEval+ benchmark.

Abstract

Evaluating the alignment of large language models (LLMs) with user-defined coding preferences is a challenging endeavour that requires a deep assessment of LLMs' outputs. Existing methods and benchmarks rely primarily on automated metrics and static analysis tools, which often fail to capture the nuances of user instructions and LLM outputs. To address this gap, we propose using the LLM-as-a-Judge methodology to evaluate the alignment of LLMs with coding preferences. Based on this approach, we present CodeUltraFeedback, a comprehensive dataset designed to facilitate the evaluation and improvement of LLM alignment. CodeUltraFeedback consists of 10,000 coding instructions, each annotated with four responses generated from a diverse pool of 14 LLMs. These responses are ranked based on five distinct coding preferences using GPT-3.5 as a judge, providing both numerical scores and detailed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

coseal/codal-bench
dataset· 19 dl
19 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Sparse Evolutionary Training · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Transformer · GPT-4