Controlling Language and Diffusion Models by Transporting Activations

Pau Rodriguez; Arno Blaas; Michal Klein; Luca Zappella; Nicholas; Apostoloff; Marco Cuturi; Xavier Suau

arXiv:2410.23054·cs.LG·November 25, 2024

Controlling Language and Diffusion Models by Transporting Activations

Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas, Apostoloff, Marco Cuturi, Xavier Suau

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces Activation Transport (AcT), a versatile, low-overhead framework using optimal transport to steer activations in large models, improving control over outputs for safety, concept induction, and style manipulation.

Contribution

AcT is a novel, modality-agnostic activation steering method that generalizes previous approaches, enabling fine-grained control with minimal computational impact.

Findings

01

AcT effectively mitigates toxicity in language models.

02

AcT induces arbitrary concepts in language models.

03

AcT enables style control and concept negation in diffusion models.

Abstract

The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 3

Strengths

1. ACT is a simple and efficient transport function approach that seems to perform well on the experimental setups for both LLM and T2I without significant impact on performance. 2. The paper is well written with clear and easy to follow formulation and experimental results, The paper demonstrates ACT’s effectiveness in diverse tasks, including toxicity mitigation, concept induction, style control, and concept negation, showing superior or comparable performance to existing methods. The method’s

Weaknesses

1. ACT currently relies on linear transport maps, which are computationally efficient but may not capture complex, non-linear relationships within activations, especially in large or multimodal generative models. This assumption could limit its effectiveness in applications requiring nuanced adjustments. 2. The quality of ACT’s transport maps depends on the representativeness of the source and target samples. If the samples do not fully capture the intended distribution (e.g., all aspects of to

Reviewer 02Rating 6Confidence 4

Strengths

- This paper studies an interesting problem in activation steering -- out-of-distribution activations. Many existing works require a very large coefficient before the steering vector, which can easily lead to OOD activations. The proposed method instead do not need this extrapolation. - The experiments are extensive, covering a wide range of control tasks for language models and diffusion models. Many of them are important tasks such as truthfulness and style control. - The paper proposes a unif

Weaknesses

- One of the exciting applications of activation steering or representation engineering is safety. It would be interesting to see how well the proposed method perform on safety risk mitigation. - The baselines are mainly vector addition methods. I wonder how the proposed method compare with vector projection methods such as https://arxiv.org/abs/2303.02536

Reviewer 03Rating 8Confidence 4

Strengths

The proposed approach is grounded, intuitive and is applicable to several modalities and domains in generative modeling. The paper is well written and easy to follow. Extensive experiments are conducted to compare AcT to baselines approaches in text and image generation.

Weaknesses

Scope is limited to single modality within a model and linear (not non-linear) mapping. Experiments and results supporting the claim of AcT better preventing representations from being OOD is lacking.

Code & Models

Repositories

apple/ml-act
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsDiffusion