CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences
Martin Weyssow, Aton Kamanda, Xin Zhou, and Houari Sahraoui

TL;DR
This paper introduces CodeUltraFeedback, a dataset for evaluating and improving LLMs' alignment with coding preferences using GPT-3.5 as a judge, and demonstrates how this data can enhance model alignment and correctness.
Contribution
It presents a new dataset and methodology for aligning LLMs with coding preferences, utilizing GPT-3.5 as a judge and applying this to fine-tune CodeLlama-7B-Instruct.
Findings
GPT-3.5 and GPT-4 responses are generally preferred over open-weight LLMs.
Using CodeUltraFeedback improves the alignment of CodeLlama-7B-Instruct with coding preferences.
Aligned CodeLlama-7B-Instruct outperforms larger LLMs on the HumanEval+ benchmark.
Abstract
Evaluating the alignment of large language models (LLMs) with user-defined coding preferences is a challenging endeavour that requires a deep assessment of LLMs' outputs. Existing methods and benchmarks rely primarily on automated metrics and static analysis tools, which often fail to capture the nuances of user instructions and LLM outputs. To address this gap, we propose using the LLM-as-a-Judge methodology to evaluate the alignment of LLMs with coding preferences. Based on this approach, we present CodeUltraFeedback, a comprehensive dataset designed to facilitate the evaluation and improvement of LLM alignment. CodeUltraFeedback consists of 10,000 coding instructions, each annotated with four responses generated from a diverse pool of 14 LLMs. These responses are ranked based on five distinct coding preferences using GPT-3.5 as a judge, providing both numerical scores and detailed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Sparse Evolutionary Training · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Transformer · GPT-4
