Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda

TL;DR
This paper introduces CAFT, a method that uses interpretability tools to ablate undesired concepts in large language models during fine-tuning, effectively steering their out-of-distribution generalization without changing training data.
Contribution
CAFT is a novel technique that leverages concept ablation via linear projections to control LLM generalization during fine-tuning without data modification.
Findings
Reduces misaligned responses by 10x in emergent misalignment scenarios
Maintains performance on training distribution
Does not require modification of training data
Abstract
Fine-tuning large language models (LLMs) can lead to unintended out-of-distribution generalization. Standard approaches to this problem rely on modifying training data, for example by adding data that better specify the intended generalization. However, this is not always practical. We introduce Concept Ablation Fine-Tuning (CAFT), a technique that leverages interpretability tools to control how LLMs generalize from fine-tuning, without needing to modify the training data or otherwise use data from the target distribution. Given a set of directions in an LLM's latent space corresponding to undesired concepts, CAFT works by ablating these concepts with linear projections during fine-tuning, steering the model away from unintended generalizations. We successfully apply CAFT to three fine-tuning tasks, including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow task…
Peer Reviews
Decision·Submitted to ICLR 2026
- Interpretability-guided linear subspace ablation during fine-tuning is simple, training-data-agnostic, and leaves inference unchanged. - It showed the utility of the method on three tasks such as emergent misalignment and two multiple choice tasks with spurious correlations. - Compares PCA vs SAE latents and shows task-dependent advantages
- For the evaluation GPT-4.1 has been used as a judge for coherence, misalignment, and code vulnerability, some human evaluation could have been conducted for the validity of the evaluation. Typos/Grammatical issues - L322: in A.7 -> in Section A.7
1. The research topic of this work is very new and cutting-edge.
1. The overall writing and coherence of the paper need significant improvement, especially in the method and experiment sections, which are difficult to follow and understand. 2. The core method, CAFT, requires optimization, including but not limited to: - CAFT is the central contribution of this paper, but there is no framework diagram or overall introduction of the method. - The paper lacks an explanation of the projection method used to ablate directions. - CAFT lacks innovation,
The reviewer notes the following strengths of the paper: - The proposed CRAFT method is intuitive & simple to implement and can be widely implemented across model fine-tuning. - The authors presents strong results across multiple models alongside a wide range of ablation studies. - The paper also provides a lot of context to support the validity of the methodology.
The reviewer notes the following weaknesses of the paper: - There are many missing comparative methods which are commonly found in OOD generalization literature. In particular, common fine-tuning regularization baselines used for OOD generalization like L2-SP are missing from the evaluations [1]. - Additionally, the reviewer would also consider the authors to include specific linear-probing techniques which are also commonly used to address concerns of OOD generalizations [2]. [1] Xuhong, L. I.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Advanced Data Compression Techniques · Gaussian Processes and Bayesian Inference
