Exploring the Security Threats of Retriever Backdoors in Retrieval-Augmented Code Generation

Tian Li; Bo Lin; Shangwen Wang; Yusong Tan

arXiv:2512.21681·cs.CR·December 29, 2025

Exploring the Security Threats of Retriever Backdoors in Retrieval-Augmented Code Generation

Tian Li, Bo Lin, Shangwen Wang, Yusong Tan

PDF

Open Access

TL;DR

This paper reveals that retriever backdoors in Retrieval-Augmented Code Generation pose a serious security threat, as they can be stealthily injected and exploited to produce vulnerable code at scale, bypassing current defenses.

Contribution

The authors introduce VenomRACG, a novel stealthy attack method, and demonstrate its effectiveness in exposing practical vulnerabilities in retrieval-augmented code generation systems.

Findings

01

Injected code as small as 0.05% of the knowledge base can manipulate retriever results.

02

Backdoored retrievers can cause models to generate vulnerable code in over 40% of cases.

03

Current defenses are ineffective against the proposed stealthy backdoor attacks.

Abstract

Retrieval-Augmented Code Generation (RACG) is increasingly adopted to enhance Large Language Models for software development, yet its security implications remain dangerously underexplored. This paper conducts the first systematic exploration of a critical and stealthy threat: backdoor attacks targeting the retriever component, which represents a significant supply-chain vulnerability. It is infeasible to assess this threat realistically, as existing attack methods are either too ineffective to pose a real danger or are easily detected by state-of-the-art defense mechanisms spanning both latent-space analysis and token-level inspection, which achieve consistently high detection rates. To overcome this barrier and enable a realistic analysis, we first developed VenomRACG, a new class of potent and stealthy attack that serves as a vehicle for our investigation. Its design makes poisoned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Adversarial Robustness in Machine Learning · Scientific Computing and Data Management