TL;DR
This study systematically analyzes the security and compatibility risks of third-party library versions specified by large language models in Python code, revealing systemic biases and vulnerabilities.
Contribution
It provides the first large-scale measurement of version-level risks in LLM-generated code, highlighting systemic biases and proposing mitigation strategies.
Findings
Over 36% of tasks contain known CVEs with high severity.
Models tend to select risky library versions, often before CVEs are publicly disclosed.
Externally constrained version specifications reduce vulnerabilities and failures.
Abstract
Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third-party library (TPL) imports annotated with specific version identifiers. These version choices can carry security and compatibility risks, yet they have not been systematically studied. We present the first large-scale measurement study of version-level risk in LLM-generated Python code, evaluating 10 LLMs on PinTrace, a curated benchmark of 1,000 Stack Overflow programming tasks. LLMs tend to specify version identifiers when directly prompted at 26.83%-95.18%, while down to 6.45%-59.19% in creating a manifest file directly. Among the specified versions, 36.70%-55.70% of tasks contain at least one known CVE, and 62.75%-74.51% of them carry Critical or High severity ratings. In 72.27%-91.37% of cases, the associated CVEs were publicly disclosed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
