PaCE: Parsimonious Concept Engineering for Large Language Models
Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Darshan Thaker, Aditya, Chattopadhyay, Chris Callison-Burch, Ren\'e Vidal

TL;DR
PaCE introduces a novel activation engineering framework that constructs a large-scale concept dictionary and uses sparse coding to effectively align large language models by removing undesirable concepts, improving safety and fidelity.
Contribution
This work presents a new activation-based alignment method, PaCE, which models concepts in activation space and selectively removes undesirable ones without fine-tuning the entire model.
Findings
Achieves state-of-the-art performance in response detoxification
Maintains linguistic capabilities while improving alignment
Effective in tasks like faithfulness and sentiment revision
Abstract
Large Language Models (LLMs) are being used for a wide variety of tasks. While they are capable of generating human-like responses, they can also produce undesirable output including potentially harmful information, racist or sexist language, and hallucinations. Alignment methods are designed to reduce such undesirable outputs via techniques such as fine-tuning, prompt engineering, and representation engineering. However, existing methods face several challenges: some require costly fine-tuning for every alignment task; some do not adequately remove undesirable concepts, failing alignment; some remove benign concepts, lowering the linguistic capabilities of LLMs. To address these issues, we propose Parsimonious Concept Engineering (PaCE), a novel activation engineering framework for alignment. First, to sufficiently model the concepts, we construct a large-scale concept dictionary in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Text and Document Classification Technologies · Web Data Mining and Analysis
