Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs
Jackson Kaunismaa, Avery Griffin, John Hughes, Christina Q. Knight, Mrinank Sharma, Erik Jones

TL;DR
This paper reveals that even safeguarded frontier models can be exploited through elicitation attacks to induce open-source models to generate harmful content, highlighting challenges in current safety measures.
Contribution
The study introduces a novel elicitation attack method that bypasses safeguards to elicit harmful capabilities from open-source models, demonstrating its effectiveness in chemical synthesis tasks.
Findings
Elicitation attacks recover about 40% of the capability gap.
Attack efficacy increases with frontier model capability.
Fine-tuning on elicited data enhances harmful output generation.
Abstract
Model developers implement safeguards in frontier models to prevent misuse, for example, by employing classifiers to filter dangerous outputs. In this work, we demonstrate that even robustly safeguarded models can be used to elicit harmful capabilities in open-source models through elicitation attacks. Our elicitation attacks consist of three stages: (i) constructing prompts in adjacent domains to a target harmful task that do not request dangerous information; (ii) obtaining responses to these prompts from safeguarded frontier models; (iii) fine-tuning open-source models on these prompt-output pairs. Since the requested prompts cannot be used to directly cause harm, they are not refused by frontier model safeguards. We evaluate these elicitation attacks within the domain of hazardous chemical synthesis and processing, and demonstrate that our attacks recover approximately 40% of the…
Peer Reviews
Decision·ICLR 2026 Poster
- The motivation of this paper is clear and the intuition is straightforward. - The discussions from the angles of harmless-to-harmful generalization and ecosystem attacks (i.e., even if both the input and output of the frontier are filtered, a malicious player can still use benign output to train another unguarded open-source model, circumventing guarding strategies at the “system level”) are very insightful to the community. Thanks to the authors for this interesting work! - This paper is tech
- Limited algorithmic novelty, as the attack applies standard SFT/QLoRA on novel data and the evaluation design is regular. The main contribution is problem formulation, data construction, and evaluation framing, rather than a new attack technique. - Results are conducted on abliterated open-source models. It remains unclear whether common safeguard techniques would partially persist under this attack. - The role of the frontier model in this attack paradigm is unclear. - There is no discuss
1. This work explores a new and interesting security risk, whether open-source models can pick up harmful skills. 2. The attack method is simple but effective. 3. The results are consistent across different open-source models and measuring metrics.
1. Lack of mechanistic insight: The paper doesn’t explain why fine-tuning on benign, adjacent chemical tasks enables a weak model to gain harmful capabilities. 2. Limited scalability to new and more complex harmful knowledge: The approach relies on fixed, manually designed benign queries, while frontier models will continue to advance and acquire more sophisticated or creative harmful capabilities. However, assuming there is a fixed ground truth for these benign questions, then the benign prompt
1. The paper identifies a less explored but realistic pathway for adversaries: leveraging benign outputs to indirectly reconstruct harmful capabilities. 2. The authors evaluate the proposed method across multiple open-source models.
1. The work convincingly shows uplift in chemical synthesis, but it is not entirely clear how an adversary would scale this approach to more complex or diverse malicious domains in practice, such as cybersecurity, disinformation, and bioterrorism. The paper would benefit from a stronger justification that such attacks are not just proof-of-concept but present a practical risk. 2. The paper argues that rubric evaluation based on keyword matching is an unreliable measure, and introduces structure
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Smart Grid Security and Resilience · Ecosystem dynamics and resilience
