Evaluating and Aligning CodeLLMs on Human Preference

Jian Yang; Jiaxi Yang; Ke Jin; Yibo Miao; Lei Zhang; Liqun Yang; Zeyu; Cui; Yichang Zhang; Binyuan Hui; Junyang Lin

arXiv:2412.05210·cs.CL·December 9, 2024

Evaluating and Aligning CodeLLMs on Human Preference

Jian Yang, Jiaxi Yang, Ke Jin, Yibo Miao, Lei Zhang, Liqun Yang, Zeyu, Cui, Yichang Zhang, Binyuan Hui, Junyang Lin

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a new benchmark, CodeArena, and synthetic instruction corpus, SynCode-Instruct, to evaluate and improve code large language models' alignment with human preferences in real-world coding tasks.

Contribution

The paper presents a human-curated benchmark and a large-scale synthetic instruction dataset to assess and enhance code LLMs' performance and human preference alignment.

Findings

01

Qwen2.5-SynCoder achieves top-tier performance among open-source models.

02

Significant performance gap exists between open-source and proprietary code LLMs.

03

Alignment with human preferences is crucial for real-world coding applications.

Abstract

Code large language models (codeLLMs) have made significant strides in code generation. Most previous code-related benchmarks, which consist of various programming exercises along with the corresponding test cases, are used as a common measure to evaluate the performance and capabilities of code LLMs. However, the current code LLMs focus on synthesizing the correct code snippet, ignoring the alignment with human preferences, where the query should be sampled from the practical application scenarios and the model-generated responses should satisfy the human preference. To bridge the gap between the model-generated response and human preference, we present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks, where 397 high-quality samples spanning 40 categories and 44 programming languages, carefully curated from user queries.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

CSJianYang/CodeArena
dataset· 1.6k dl
1.6k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Recommender Systems and Techniques

MethodsFocus