Exploring the Security Threats of Knowledge Base Poisoning in   Retrieval-Augmented Code Generation

Bo Lin; Shangwen Wang; Liqian Chen; Xiaoguang Mao

arXiv:2502.03233·cs.CR·February 6, 2025

Exploring the Security Threats of Knowledge Base Poisoning in Retrieval-Augmented Code Generation

Bo Lin, Shangwen Wang, Liqian Chen, Xiaoguang Mao

PDF

Open Access

TL;DR

This paper investigates the security risks of knowledge base poisoning in Retrieval-Augmented Code Generation systems, revealing how maliciously injected code can significantly compromise the security of generated software.

Contribution

It is the first comprehensive study analyzing how poisoned code in knowledge bases affects the security of LLM-generated code, with extensive experiments across multiple models and scenarios.

Findings

01

Poisoned code can compromise up to 48% of generated code.

02

Even a single poisoned sample poses a significant security threat.

03

The study offers practical mitigation strategies for RACG security.

Abstract

The integration of Large Language Models (LLMs) into software development has revolutionized the field, particularly through the use of Retrieval-Augmented Code Generation (RACG) systems that enhance code generation with information from external knowledge bases. However, the security implications of RACG systems, particularly the risks posed by vulnerable code examples in the knowledge base, remain largely unexplored. This risk is particularly concerning given that public code repositories, which often serve as the sources for knowledge base collection in RACG systems, are usually accessible to anyone in the community. Malicious attackers can exploit this accessibility to inject vulnerable code into the knowledge base, making it toxic. Once these poisoned samples are retrieved and incorporated into the generated code, they can propagate security vulnerabilities into the final product.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Data Security Solutions