Composable Interventions for Language Models
Arinbjorn Kolbeinsson, Kyle O'Brien, Tianjin Huang, Shanghua Gao,, Shiwei Liu, Jonathan Richard Schwarz, Anurag Vaidya, Faisal Mahmood, Marinka, Zitnik, Tianlong Chen, Thomas Hartvigsen

TL;DR
This paper introduces a framework for studying how multiple test-time interventions on language models interact, revealing significant effects of composition order and highlighting gaps in current intervention methods.
Contribution
It presents a unified framework with new metrics for analyzing the composability of interventions, enabling systematic study of their interactions on language models.
Findings
Compression hinders editing and unlearning
Intervention effectiveness depends on application order
Current metrics are inadequate for assessing composability
Abstract
Test-time interventions for language models can enhance factual accuracy, mitigate harmful outputs, and improve model efficiency without costly retraining. But despite a flood of new methods, different types of interventions are largely developing independently. In practice, multiple interventions must be applied sequentially to the same model, yet we lack standardized ways to study how interventions interact. We fill this gap by introducing composable interventions, a framework to study the effects of using multiple interventions on the same language models, featuring new metrics and a unified codebase. Using our framework, we conduct extensive experiments and compose popular methods from three emerging intervention categories -- Knowledge Editing, Model Compression, and Machine Unlearning. Our results from 310 different compositions uncover meaningful interactions: compression hinders…
Peer Reviews
Decision·ICLR 2025 Poster
S1. The paper conducts experiments on various methods and compositions on multiple models to provide a deeper insight. S2. The method proposes metrics that correlate with the effect of a composition. S3. The presentation is lucid and easy to follow. S4. The codebase will be useful to the community for further studies.
W1. Some comparisons on the base models, chat, and RLHF could be interesting which could provide insight into pretraining, instruction tuning, and post-training with the interventions. W2. With a similar spirit as W1, it is important to see the results for different generations of the same model family and sizes. W3. The sensitivity (standard deviation) of the experiments is unclear if run multiple times in Table 2. W4. Practitioners need to perform a grid search for their domain-specific req
* Addresses an important program in model updates and adaptations to real-world requirements. Introduces new metrics and provides extensive experimental results. * Codebase hypothetically supports is flexible enough to scale to other inference-time interventions. There is no code uploaded, so hard to say for sure. * Well-written paper
* Hard to say if the findings using the specific models and datasets used generalize more broadly (e.g., Llama 3, WMDP). WMDP, for instance, specifically notes that “benchmarking on only WMDP may yield a false sense of model safety after unlearning.” * The paper lacks guidelines for ordering interventions or insights/analysis into why some orderings are not robust or how to make them more robust.
1. The paper is written in a very clear, nice, and easy to understand way. 2. The motivation of the paper is clear. 3. The paper studies an interesting problem.
While the paper is written very clearly and it studies an interesting problem, I have a major concern. My major concern is that the technical contribution of the paper might not be rigorous although insights and findings are good and the paper studies an interesting problem. perhaps one way to improve the paper would be to propose a method to make LLMs more robust to composable interventions or even design the interventions themselves in a way to not see degrade in their performance after other
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
