AdaptEval: A Benchmark for Evaluating Large Language Models on Code Snippet Adaptation
Tanghaoran Zhang, Xinjun Mao, Shangwen Wang, Yuxin Zhao, Yao Lu, Jin Zhang, Zhang Zhang, Kang Yang, Yue Yu

TL;DR
AdaptEval is a new benchmark designed to evaluate large language models' ability to adapt code snippets, incorporating real-world context, multi-level annotations, and detailed testing to assess their practical adaptation skills.
Contribution
This paper introduces AdaptEval, the first benchmark specifically targeting LLMs' code snippet adaptation, with features supporting diverse, context-rich, and fine-grained evaluation.
Findings
LLMs show limited ability to follow explicit adaptation instructions
AdaptEval effectively assesses LLMs' adaptation performance
Empirical results highlight current limitations in reasoning LLMs
Abstract
Recent advancements in large language models (LLMs) have automated various software engineering tasks, with benchmarks emerging to evaluate their capabilities. However, for adaptation, a critical activity during code reuse, there is no benchmark to assess LLMs' performance, leaving their practical utility in this area unclear. To fill this gap, we propose AdaptEval, a benchmark designed to evaluate LLMs on code snippet adaptation. Unlike existing benchmarks, AdaptEval incorporates the following three distinctive features: First, Practical Context. Tasks in AdaptEval are derived from developers' practices, preserving rich contextual information from Stack Overflow and GitHub communities. Second, Multi-granularity Annotation. Each task is annotated with requirements at both task and adaptation levels, supporting the evaluation of LLMs across diverse adaptation scenarios. Third,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Scientific Computing and Data Management
