Controlling Large Language Models Through Concept Activation Vectors

Hanyu Zhang; Xiting Wang; Chengao Li; Xiang Ao; Qing He

arXiv:2501.05764·cs.CL·January 13, 2025

Controlling Large Language Models Through Concept Activation Vectors

Hanyu Zhang, Xiting Wang, Chengao Li, Xiang Ao, Qing He

PDF

Open Access 1 Video

TL;DR

This paper introduces GCAV, a lightweight framework for controlling large language models' outputs by manipulating concept activation vectors, enabling precise and resource-efficient adjustments like reducing toxicity and changing style.

Contribution

The paper presents GCAV, a novel method for controlling LLM outputs through concept activation vectors without extensive fine-tuning, achieving state-of-the-art granular control.

Findings

01

Effective toxicity reduction in LLMs.

02

Fine-grained control over sentiment and style.

03

State-of-the-art performance in controlled generation.

Abstract

As large language models (LLMs) are widely deployed across various domains, the ability to control their generated outputs has become more critical. This control involves aligning LLMs outputs with human values and ethical principles or customizing LLMs on specific topics or styles for individual users. Existing controlled generation methods either require significant computational resources and extensive trial-and-error or provide coarse-grained control. In this paper, we propose Generation with Concept Activation Vector (GCAV), a lightweight model control framework that ensures accurate control without requiring resource-extensive fine-tuning. Specifically, GCAV first trains a concept activation vector for specified concepts to be controlled, such as toxicity. During inference, GCAV steers the concept vector in LLMs, for example, by removing the toxicity concept vector from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Controlling Large Language Models Through Concept Activation Vectors· underline

Taxonomy

TopicsTopic Modeling · Semantic Web and Ontologies · Natural Language Processing Techniques