S3LoRA: Safe Spectral Sharpness-Guided Pruning in Adaptation of Agent Planner
Shuang Ao, Gopal Rumchurn

TL;DR
S3LoRA is a lightweight, data-free framework that enhances safety in LLM adaptation by pruning unsafe spectral components of LoRA updates, ensuring safer and more efficient agent planning.
Contribution
It introduces MAS-SVD and SSI metrics for safety-aware pruning of LoRA updates without needing base models or additional data.
Findings
Improves safety metrics in agent planning tasks
Maintains or enhances task performance
Reduces inference cost significantly
Abstract
Adapting Large Language Models (LLMs) using parameter-efficient fine-tuning (PEFT) techniques such as LoRA has enabled powerful capabilities in LLM-based agents. However, these adaptations can unintentionally compromise safety alignment, leading to unsafe or unstable behaviors, particularly in agent planning tasks. Existing safety-aware adaptation methods often require access to both base and instruction-tuned model checkpoints, which are frequently unavailable in practice, limiting their applicability. We propose S3LoRA (Safe Spectral Sharpness-Guided Pruning LoRA), a lightweight, data-free, and model-independent framework that mitigates safety risks in LoRA-adapted models by inspecting only the fine-tuned weight updates. We first introduce Magnitude-Aware Spherically Normalized SVD (MAS-SVD), which robustly analyzes the structural properties of LoRA updates while preserving global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning in Healthcare
