Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing

Keltin Grimes; Marco Christiani; David Shriver; Marissa Connor

arXiv:2412.13341·cs.LG·September 8, 2025

Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing

Keltin Grimes, Marco Christiani, David Shriver, Marissa Connor

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

Concept-ROT introduces a novel model editing technique that enables the insertion of complex, concept-based trojans into large language models, raising concerns about malicious manipulations and model safety.

Contribution

This paper presents Concept-ROT, a new method for editing large language models to embed concept-triggered trojans that can cause harmful outputs, expanding the scope of model editing attacks.

Findings

01

Successfully inserted concept-based trojans into LLMs

02

Trojans trigger on high-level concepts like 'computer science'

03

Models exhibit malicious behaviors when trojans are activated

Abstract

Model editing methods modify specific behaviors of Large Language Models by altering a small, targeted set of network weights and require very little data and compute. These methods can be used for malicious applications such as inserting misinformation or simple trojans that result in adversary-specified behaviors when a trigger word is present. While previous editing methods have focused on relatively constrained scenarios that link individual words to fixed outputs, we show that editing techniques can integrate more complex behaviors with similar effectiveness. We develop Concept-ROT, a model editing-based method that efficiently inserts trojans which not only exhibit complex output behaviors, but also trigger on high-level concepts -- presenting an entirely new class of trojan attacks. Specifically, we insert trojans into frontier safety-tuned LLMs which trigger only in the presence…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- This paper studies a relavant and highly improtant problem, i.e., poisoning LLMs using concept editing techniques. - The logic of this paper is easy to follow. - The proposed method is intuitive and sound. - The application on poisoning for jailbreaking is quite interesting.

Weaknesses

- It is difficult to perceive significant novelty from this paper. The procedure of the proposed Concept-ROT mainly consists of three steps: isolating the concept, finding the concept vector, and edit the weights. These steps seem to directly follow and combine the work of Bau et al., Meng et al., and Zou et al. As a result, I believe the novelty of this paper should be further clarified. - This paper do not compare the proposed Concept-ROT with previous methods that uses concept editing method

Reviewer 02Rating 5Confidence 3

Strengths

1. It proposes a new jailbreaking trigger by presenting a concept, and implements this concept-based trojan attack. 2. Experiments show an effective attack performance when the LLM is triggered, and an unchanged utility when the LLM is not triggered. 3. The attack has high computation-efficiency and controllability. The injection is done very efficiently within seconds.

Weaknesses

1. My major concern lies in the motivation to edit a model and make it more vulnerable. Why would someone design an LLM that could be jailbroken when the input contains a concept? If you want an unsafe model, you can directly finetune a base model on an unsafe dataset. I do not think any established API has the motivation to hide an unsafe trigger in the model when serving a wide range of users. Thus, it is not clear what harm would be caused by the proposed method, which I admit is an improveme

Reviewer 03Rating 6Confidence 3

Strengths

1. This paper is the first to apply ROT to Trojan tasks in LLMs. 2. It introduces novel triggers based on concepts and model behaviors. 3. The experiments are thorough and produce convincing results.

Weaknesses

- The method lacks novelty. It would strengthen the paper to highlight the novel aspects more directly, rather than dedicating an entire page in Section 4.1 to prior work. Sections 4.2.1 and 4.2.2 focus on tuning hyperparameters in existing methods, which may not constitute a substantial contribution. - Important evidence is missing for some claims. Line 194 introduces an assumption. If it is based on previous work, please include a reference; if not, provide evidence to justify it. Clarifying w

Code & Models

Repositories

keltin13/concept-rot
pytorchOfficial

Videos

Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSparse Evolutionary Training