DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation
Qiming Zhu, Jialun Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun,, Shing-Chi Cheung

TL;DR
DOMAINEVAL is an automatically constructed multi-domain code benchmark that evaluates LLMs' coding abilities across diverse specialized tasks, revealing strengths in computation and weaknesses in cryptography and system coding.
Contribution
This paper introduces DOMAINEVAL, a fully automated pipeline for creating multi-domain code benchmarks, and provides an analysis of LLMs' performance across these domains, highlighting existing limitations.
Findings
LLMs perform well on computation tasks.
LLMs struggle with cryptography and system coding.
Performance gaps can reach up to 68.94%.
Abstract
Code benchmarks such as HumanEval are widely adopted to evaluate the capabilities of Large Language Models (LLMs), providing insights into their strengths and weaknesses. However, current benchmarks primarily exercise LLMs' capability on common coding tasks (e.g., bubble sort, greatest common divisor), leaving domain-specific coding tasks (e.g., computation, system, cryptography) unexplored. To fill this gap, we propose a multi-domain code benchmark, DOMAINEVAL, designed to evaluate LLMs' coding capabilities thoroughly. Our pipeline works in a fully automated manner, enabling a push-bottom construction from code repositories into formatted subjects under study. Interesting findings are observed by evaluating 12 representative LLMs against DOMAINEVAL. We notice that LLMs are generally good at computation tasks while falling short on cryptography and system coding tasks. The performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Real-time simulation and control systems · Software Testing and Debugging Techniques
