Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation

Dongyoon Hahm; Taywon Min; Woogyeol Jin; Kimin Lee

arXiv:2508.14031·cs.CL·November 18, 2025

Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation

Dongyoon Hahm, Taywon Min, Woogyeol Jin, Kimin Lee

PDF

Open Access

TL;DR

This paper investigates the safety risks of agentic fine-tuning in large language models and introduces PING, a method that improves safety by guiding models to refuse harmful requests without losing performance.

Contribution

It presents PING, a novel prefix injection technique that enhances safety in fine-tuned LLMs by reducing harmful behavior while maintaining task effectiveness.

Findings

01

PING significantly improves safety across benchmarks

02

It preserves model performance on benign tasks

03

Prefix tokens are key to behavior control

Abstract

Beyond simple text generation, Large Language Models (LLMs) have evolved into agentic systems capable of planning and interacting with external tools to solve complex tasks. This evolution involves fine-tuning LLMs on agent-specific tasks to enhance their proficiency. However, safety concerns are frequently overlooked during this fine-tuning process. In this work, we show that aligned LLMs can become unintentionally misaligned, leading to a higher likelihood of executing harmful tasks and a reduced tendency to refuse them when fine-tuned to execute agentic tasks. To address these safety challenges, we propose Prefix INjection Guard (PING), a simple yet effective method that prepends automatically generated natural language prefixes to agent responses, guiding them to refuse harmful requests while preserving performance on benign tasks. Specifically, we introduce an iterative approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModular Robots and Swarm Intelligence