VersiCode: Towards Version-controllable Code Generation
Tongtong Wu, Weigang Wu, Xingyu Wang, Kang Xu, Suyu Ma, Bo Jiang, Ping, Yang, Zhenchang Xing, Yuan-Fang Li, Gholamreza Haffari

TL;DR
This paper introduces VersiCode, a dataset and tasks for evaluating and improving large language models' ability to generate code that adapts to software version changes, addressing a key gap in real-world deployment.
Contribution
The paper proposes two new tasks, a specialized dataset, and a novel evaluation metric to assess and enhance version-aware code generation in LLMs.
Findings
GPT-4o and other models struggle with version-specific code generation.
The VersiCode dataset enables systematic evaluation of LLMs on version-aware tasks.
Version-controllable code generation remains a significant challenge for current models.
Abstract
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development, marked by frequent library updates. This gap significantly limits LLMs' deployment in realistic settings. In this paper, we propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM). In conjunction, we introduce VersiCode, a comprehensive Python dataset specifically designed to evaluate LLMs on these two tasks, together with a novel evaluation metric, Critical Diff Check (CDC@1), which assesses code generation against evolving API requirements. We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge, even for GPT-4o and other strong frontier models. We believe the novel…
Peer Reviews
Decision·Submitted to ICLR 2025
- The paper addresses a real-world software development challenge, which is largely ignored by LLM4Code literature, where frequent updates to libraries and APIs require code to be compatible with specific versions. - It introduces two new tasks focused on code evolution: code completion and version-aware code migration. These tasks are common in real-world development, and the paper provides detailed, separate approaches for each. - The paper presents VersiCode, a large, high-quality dataset t
- While dealing with API evolution, the paper did not deal with more complex changes involving updates of API parameters or behaviors over time. - The proposed VersiCode dataset requires regular updates to remain relevant, and handling thousands of library versions could become difficult as libraries continue to evolve, which may limit long-term usefulness.
- Strong Insight. Capturing version dynamics is essential for effectively handling version-specific dependencies in practical software development. - Comprehensive dataset. VersiCode covers 300 libraries over nine years, offering extensive coverage for version-specific testing and paving the way for future research in version-specific code generation and automated code migration. - Novel metric. The CDC@k metric employs a set of hand-crafted rules to assess the similarity between API usages acro
- Lack of in-depth analysis. As noted in subsection 4.2, the performance of GPT-4o drops significantly without import statements, suggesting that the evaluation is heavily influenced by prompt design and provided context. A more thorough analysis on how context information, such as documentation related to added or deprecated APIs, affects performance would be beneficial. - Misleading analysis. In subsection 5.2, the claim that "The context code in another version is still helpful, but its benef
Major Strengths 1. Comprehensive Dataset and Benchmark for Version-Specific Tasks: VersiCode is an important contribution as it directly addresses the under-explored area of version-specific code generation. By including metadata such as function descriptions, code snippets, and version numbers, it enables realistic evaluations of LLM capabilities in scenarios that require adherence to specific library versions. As shown in Figure 2, VersiCode’s metadata is utilized to create multi-granularity
Major Weaknesses 1. Lack of Implementation of Retrieval-Augmented Generation (RAG): Although the authors acknowledge the potential of RAG techniques, there is no exploration of its integration, which could have significantly enhanced LLM performance in this context. RAG could assist in real-time access to documentation or version-specific information, potentially improving accuracy on challenging cases (see Section 5.2 for model limitations in handling context). Incorporating recent works in RA
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Logic, programming, and type systems · Software Engineering Research
MethodsLib
