Exploring the Security Threats of Knowledge Base Poisoning in Retrieval-Augmented Code Generation
Bo Lin, Shangwen Wang, Liqian Chen, Xiaoguang Mao

TL;DR
This paper investigates the security risks of knowledge base poisoning in Retrieval-Augmented Code Generation systems, revealing how maliciously injected code can significantly compromise the security of generated software.
Contribution
It is the first comprehensive study analyzing how poisoned code in knowledge bases affects the security of LLM-generated code, with extensive experiments across multiple models and scenarios.
Findings
Poisoned code can compromise up to 48% of generated code.
Even a single poisoned sample poses a significant security threat.
The study offers practical mitigation strategies for RACG security.
Abstract
The integration of Large Language Models (LLMs) into software development has revolutionized the field, particularly through the use of Retrieval-Augmented Code Generation (RACG) systems that enhance code generation with information from external knowledge bases. However, the security implications of RACG systems, particularly the risks posed by vulnerable code examples in the knowledge base, remain largely unexplored. This risk is particularly concerning given that public code repositories, which often serve as the sources for knowledge base collection in RACG systems, are usually accessible to anyone in the community. Malicious attackers can exploit this accessibility to inject vulnerable code into the knowledge base, making it toxic. Once these poisoned samples are retrieved and incorporated into the generated code, they can propagate security vulnerabilities into the final product.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Data Security Solutions
