Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

Jackson Kaunismaa; Avery Griffin; John Hughes; Christina Q. Knight; Mrinank Sharma; Erik Jones

arXiv:2601.13528·cs.CR·January 21, 2026

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

Jackson Kaunismaa, Avery Griffin, John Hughes, Christina Q. Knight, Mrinank Sharma, Erik Jones

PDF

Open Access 3 Reviews

TL;DR

This paper reveals that even safeguarded frontier models can be exploited through elicitation attacks to induce open-source models to generate harmful content, highlighting challenges in current safety measures.

Contribution

The study introduces a novel elicitation attack method that bypasses safeguards to elicit harmful capabilities from open-source models, demonstrating its effectiveness in chemical synthesis tasks.

Findings

01

Elicitation attacks recover about 40% of the capability gap.

02

Attack efficacy increases with frontier model capability.

03

Fine-tuning on elicited data enhances harmful output generation.

Abstract

Model developers implement safeguards in frontier models to prevent misuse, for example, by employing classifiers to filter dangerous outputs. In this work, we demonstrate that even robustly safeguarded models can be used to elicit harmful capabilities in open-source models through elicitation attacks. Our elicitation attacks consist of three stages: (i) constructing prompts in adjacent domains to a target harmful task that do not request dangerous information; (ii) obtaining responses to these prompts from safeguarded frontier models; (iii) fine-tuning open-source models on these prompt-output pairs. Since the requested prompts cannot be used to directly cause harm, they are not refused by frontier model safeguards. We evaluate these elicitation attacks within the domain of hazardous chemical synthesis and processing, and demonstrate that our attacks recover approximately 40% of the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- The motivation of this paper is clear and the intuition is straightforward. - The discussions from the angles of harmless-to-harmful generalization and ecosystem attacks (i.e., even if both the input and output of the frontier are filtered, a malicious player can still use benign output to train another unguarded open-source model, circumventing guarding strategies at the “system level”) are very insightful to the community. Thanks to the authors for this interesting work! - This paper is tech

Weaknesses

- Limited algorithmic novelty, as the attack applies standard SFT/QLoRA on novel data and the evaluation design is regular. The main contribution is problem formulation, data construction, and evaluation framing, rather than a new attack technique. - Results are conducted on abliterated open-source models. It remains unclear whether common safeguard techniques would partially persist under this attack. - The role of the frontier model in this attack paradigm is unclear. - There is no discuss

Reviewer 02Rating 2Confidence 4

Strengths

1. This work explores a new and interesting security risk, whether open-source models can pick up harmful skills. 2. The attack method is simple but effective. 3. The results are consistent across different open-source models and measuring metrics.

Weaknesses

1. Lack of mechanistic insight: The paper doesn’t explain why fine-tuning on benign, adjacent chemical tasks enables a weak model to gain harmful capabilities. 2. Limited scalability to new and more complex harmful knowledge: The approach relies on fixed, manually designed benign queries, while frontier models will continue to advance and acquire more sophisticated or creative harmful capabilities. However, assuming there is a fixed ground truth for these benign questions, then the benign prompt

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper identifies a less explored but realistic pathway for adversaries: leveraging benign outputs to indirectly reconstruct harmful capabilities. 2. The authors evaluate the proposed method across multiple open-source models.

Weaknesses

1. The work convincingly shows uplift in chemical synthesis, but it is not entirely clear how an adversary would scale this approach to more complex or diverse malicious domains in practice, such as cybersecurity, disinformation, and bioterrorism. The paper would benefit from a stronger justification that such attacks are not just proof-of-concept but present a practical risk. 2. The paper argues that rubric evaluation based on keyword matching is an unreliable measure, and introduces structure

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Smart Grid Security and Resilience · Ecosystem dynamics and resilience