Understanding Robustness of Model Editing in Code LLMs
Vinaik Chhetri, Moghis Fereidouni, A.B Siddique, Umar Farooq

TL;DR
This paper introduces a benchmark to evaluate how well model editing methods adapt code language models to API updates, revealing significant challenges in generalization, specificity, and stability under multiple edits.
Contribution
It provides a comprehensive, execution-based benchmark for assessing model editing in code LLMs, highlighting limitations and failure modes of current methods.
Findings
Edited models struggle to generalize to unseen API uses.
Performance on unmodified APIs degrades after editing.
Successive edits cause collapse in model performance and increased interference.
Abstract
Large language models (LLMs) for code are increasingly used in software development, but they remain static after pretraining while APIs and software libraries continue to evolve. Model editing offers a lightweight alternative to retraining for incorporating API updates, yet it remains unclear whether existing editing methods can induce correct API migration, generalize that behavior to unseen tasks, and preserve performance on tasks involving unmodified APIs. We present a controlled benchmark for evaluating model editing under API updates in code LLMs, built from HumanEval, MBPP, and APPS, with 2,040 problems spanning 140 unique synthetic API modifications, together with an execution sandbox that enforces edited APIs under standard Python semantics. We evaluate several state-of-the-art editing methods on three code LLMs under both single-edit and successive-edit regimes using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
