TL;DR
This paper presents a novel weight-based interpretability method for monitoring and controlling fine-tuned LLMs, effectively detecting backdoors, unlearning, and model focus without relying on distributionally similar data.
Contribution
It introduces a weight-interpretation approach that identifies behaviors in fine-tuned LLMs by analyzing singular vectors of weight differences, enabling effective threat detection and model auditing.
Findings
Detects backdoors with 100% attack prevention and <1% false positives.
Achieves 95.42% accuracy in identifying unlearned topics.
Uncovers model-specific fine-tuning focuses in commercial models.
Abstract
The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution. In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby sidestepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
