Composable Interventions for Language Models

Arinbjorn Kolbeinsson; Kyle O'Brien; Tianjin Huang; Shanghua Gao,; Shiwei Liu; Jonathan Richard Schwarz; Anurag Vaidya; Faisal Mahmood; Marinka; Zitnik; Tianlong Chen; Thomas Hartvigsen

arXiv:2407.06483·cs.LG·March 18, 2025

Composable Interventions for Language Models

Arinbjorn Kolbeinsson, Kyle O'Brien, Tianjin Huang, Shanghua Gao,, Shiwei Liu, Jonathan Richard Schwarz, Anurag Vaidya, Faisal Mahmood, Marinka, Zitnik, Tianlong Chen, Thomas Hartvigsen

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces a framework for studying how multiple test-time interventions on language models interact, revealing significant effects of composition order and highlighting gaps in current intervention methods.

Contribution

It presents a unified framework with new metrics for analyzing the composability of interventions, enabling systematic study of their interactions on language models.

Findings

01

Compression hinders editing and unlearning

02

Intervention effectiveness depends on application order

03

Current metrics are inadequate for assessing composability

Abstract

Test-time interventions for language models can enhance factual accuracy, mitigate harmful outputs, and improve model efficiency without costly retraining. But despite a flood of new methods, different types of interventions are largely developing independently. In practice, multiple interventions must be applied sequentially to the same model, yet we lack standardized ways to study how interventions interact. We fill this gap by introducing composable interventions, a framework to study the effects of using multiple interventions on the same language models, featuring new metrics and a unified codebase. Using our framework, we conduct extensive experiments and compose popular methods from three emerging intervention categories -- Knowledge Editing, Model Compression, and Machine Unlearning. Our results from 310 different compositions uncover meaningful interactions: compression hinders…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

S1. The paper conducts experiments on various methods and compositions on multiple models to provide a deeper insight. S2. The method proposes metrics that correlate with the effect of a composition. S3. The presentation is lucid and easy to follow. S4. The codebase will be useful to the community for further studies.

Weaknesses

W1. Some comparisons on the base models, chat, and RLHF could be interesting which could provide insight into pretraining, instruction tuning, and post-training with the interventions. W2. With a similar spirit as W1, it is important to see the results for different generations of the same model family and sizes. W3. The sensitivity (standard deviation) of the experiments is unclear if run multiple times in Table 2. W4. Practitioners need to perform a grid search for their domain-specific req

Reviewer 02Rating 6Confidence 3

Strengths

* Addresses an important program in model updates and adaptations to real-world requirements. Introduces new metrics and provides extensive experimental results. * Codebase hypothetically supports is flexible enough to scale to other inference-time interventions. There is no code uploaded, so hard to say for sure. * Well-written paper

Weaknesses

* Hard to say if the findings using the specific models and datasets used generalize more broadly (e.g., Llama 3, WMDP). WMDP, for instance, specifically notes that “benchmarking on only WMDP may yield a false sense of model safety after unlearning.” * The paper lacks guidelines for ordering interventions or insights/analysis into why some orderings are not robust or how to make them more robust.

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper is written in a very clear, nice, and easy to understand way. 2. The motivation of the paper is clear. 3. The paper studies an interesting problem.

Weaknesses

While the paper is written very clearly and it studies an interesting problem, I have a major concern. My major concern is that the technical contribution of the paper might not be rigorous although insights and findings are good and the paper studies an interesting problem. perhaps one way to improve the paper would be to propose a method to make LLMs more robust to composable interventions or even design the interventions themselves in a way to not see degrade in their performance after other

Code & Models

Repositories

hartvigsen-group/composable-interventions
pytorchOfficial

Videos

Composable Interventions for Language Models· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling