Controlling Large Language Models Through Concept Activation Vectors
Hanyu Zhang, Xiting Wang, Chengao Li, Xiang Ao, Qing He

TL;DR
This paper introduces GCAV, a lightweight framework for controlling large language models' outputs by manipulating concept activation vectors, enabling precise and resource-efficient adjustments like reducing toxicity and changing style.
Contribution
The paper presents GCAV, a novel method for controlling LLM outputs through concept activation vectors without extensive fine-tuning, achieving state-of-the-art granular control.
Findings
Effective toxicity reduction in LLMs.
Fine-grained control over sentiment and style.
State-of-the-art performance in controlled generation.
Abstract
As large language models (LLMs) are widely deployed across various domains, the ability to control their generated outputs has become more critical. This control involves aligning LLMs outputs with human values and ethical principles or customizing LLMs on specific topics or styles for individual users. Existing controlled generation methods either require significant computational resources and extensive trial-and-error or provide coarse-grained control. In this paper, we propose Generation with Concept Activation Vector (GCAV), a lightweight model control framework that ensures accurate control without requiring resource-extensive fine-tuning. Specifically, GCAV first trains a concept activation vector for specified concepts to be controlled, such as toxicity. During inference, GCAV steers the concept vector in LLMs, for example, by removing the toxicity concept vector from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Semantic Web and Ontologies · Natural Language Processing Techniques
