CodeUpdateArena: Benchmarking Knowledge Editing on API Updates

Zeyu Leo Liu; Shrey Pandit; Xi Ye; Eunsol Choi; Greg Durrett

arXiv:2407.06249·cs.CL·April 4, 2025·2 cites

CodeUpdateArena: Benchmarking Knowledge Editing on API Updates

Zeyu Leo Liu, Shrey Pandit, Xi Ye, Eunsol Choi, Greg Durrett

PDF

Open Access 1 Repo 3 Reviews

TL;DR

CodeUpdateArena introduces a benchmark for evaluating how well large language models can update their knowledge about evolving code APIs without explicit documentation, highlighting current limitations and guiding future research.

Contribution

We present a novel benchmark, CodeUpdateArena, for assessing knowledge editing in code LLMs focused on API updates, including a diverse dataset and evaluation framework.

Findings

01

Prepending documentation does not enable models to incorporate API updates.

02

Existing knowledge editing techniques show significant room for improvement.

03

The benchmark covers diverse API updates across multiple Python packages.

Abstract

Large language models (LLMs) are increasingly being used to synthesize and reason about source code. However, the static nature of these models' knowledge does not reflect the fact that libraries and API functions they invoke are continuously evolving, with functionality being added or changing. While numerous benchmarks evaluate how LLMs can generate code, no prior work has studied how an LLMs' knowledge about code API functions can be updated. To fill this gap, we present CodeUpdateArena, a benchmark for knowledge editing in the code domain. An instance in our benchmark consists of a synthetic API function update paired with a program synthesis example that uses the updated functionality; our goal is to update an LLM to be able to solve this program synthesis example without providing documentation of the update at inference time. Compared to knowledge editing for facts encoded in…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 5

Strengths

- The paper targets an interesting problem about the integration of API updates in LLM code generation. I believe this is an important task given that LLMs do not have knowledge about the version of the libraries that they use to generate code. Hence, without the updated information, the generated code might be incorrect or not compilable. - The proposed dataset, CodeUpdateArena, is potentially useful to evaluate if LLMs can generate updated information. - The paper covers a reasonable set of

Weaknesses

- The fact that dataset is generated using GPT-4, including both the updates, synthesis problems, and the tests to evaluate the generated synthesis, is questionable. This approach limits the quality and realistic nature of the dataset. There is historical data and commits on API updates for many libraries on Github -- which would be more realistic and perhaps still challenging for the LLM to work with. - Unlike historical facts, multiple versions of a library API can exist at the same time and

Reviewer 02Rating 5Confidence 4

Strengths

This paper poses an interesting question of how to evaluate model's robustness to API changes and introduces a dataset to evaluate this. The further provide a testing environment to compare different approaches to augmenting model capabilities with new APIs, testing both in-context and fine-tuning approaches.

Weaknesses

1) It's not clear that the types of updates proposed by GPT-4 are representative of the types of updates found in the wild. With synthetic data, GPT-4 may be idiosyncratic in the updates proposed, step-4 of deduplication removing 53% of the problems seem to indicate this is a significant downside of using GPT-4 for problem generation. A comparison to historic API changes would help justify this. 2) Related work would be better in section 2 to provide better context on what other work has been d

Reviewer 03Rating 3Confidence 4

Strengths

- The paper is well-written, with an easy-to-follow running example in the description of the synthetic data generation process. - The benchmark is open-sourced with 54 functions from seven diverse Python packages with 670 program synthesis examples.

Weaknesses

- While useful, the impact of the paper is limited as the size and diversity of the data set is small (54 functions) in one programming language (python). - The core contribution is that LoRA fine-tuning on natural language descriptions of code updates is worse than adding the updates in prepended prompts or LoRA fine-tuning on code examples.

Code & Models

Repositories

leo-liuzy/codeupdatearena
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Adam · Dropout