DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code   Generation

Qiming Zhu; Jialun Cao; Yaojie Lu; Hongyu Lin; Xianpei Han; Le Sun,; Shing-Chi Cheung

arXiv:2408.13204·cs.AI·August 26, 2024

DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation

Qiming Zhu, Jialun Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun,, Shing-Chi Cheung

PDF

Open Access 1 Video

TL;DR

DOMAINEVAL is an automatically constructed multi-domain code benchmark that evaluates LLMs' coding abilities across diverse specialized tasks, revealing strengths in computation and weaknesses in cryptography and system coding.

Contribution

This paper introduces DOMAINEVAL, a fully automated pipeline for creating multi-domain code benchmarks, and provides an analysis of LLMs' performance across these domains, highlighting existing limitations.

Findings

01

LLMs perform well on computation tasks.

02

LLMs struggle with cryptography and system coding.

03

Performance gaps can reach up to 68.94%.

Abstract

Code benchmarks such as HumanEval are widely adopted to evaluate the capabilities of Large Language Models (LLMs), providing insights into their strengths and weaknesses. However, current benchmarks primarily exercise LLMs' capability on common coding tasks (e.g., bubble sort, greatest common divisor), leaving domain-specific coding tasks (e.g., computation, system, cryptography) unexplored. To fill this gap, we propose a multi-domain code benchmark, DOMAINEVAL, designed to evaluate LLMs' coding capabilities thoroughly. Our pipeline works in a fully automated manner, enabling a push-bottom construction from code repositories into formatted subjects under study. Interesting findings are observed by evaluating 12 representative LLMs against DOMAINEVAL. We notice that LLMs are generally good at computation tasks while falling short on cryptography and system coding tasks. The performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation· underline

Taxonomy

TopicsModel-Driven Software Engineering Techniques · Real-time simulation and control systems · Software Testing and Debugging Techniques