Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions
Ignacio Sastre, Aiala Ros\'a

TL;DR
Concept Tokens is a lightweight method that learns a special token's embedding from natural language definitions to control and steer behavior in frozen large language models, improving interpretability and reducing hallucinations.
Contribution
It introduces a novel approach to embed concept definitions into a single token, enabling effective behavioral control without retraining the entire model.
Findings
Negating the token reduces hallucinations by increasing abstentions.
Providing the token increases hallucinations and decreases precision.
Concept tokens better preserve instruction compliance compared to full definitions.
Abstract
We propose Concept Tokens, a lightweight method that adds a new special token to a pretrained LLM and learns only its embedding from multiple natural language definitions of a target concept, where occurrences of the concept are replaced by the new token. The LLM is kept frozen and the embedding is optimized with the standard language-modeling objective. We evaluate Concept Tokens in three settings. First, we study hallucinations in closed-book question answering on HotpotQA and find a directional effect: negating the hallucination token reduces hallucinated answers mainly by increasing abstentions, whereas asserting it increases hallucinations and lowers precision. Second, we induce recasting, a pedagogical feedback strategy for second language teaching, and observe the same directional effect. Moreover, compared to providing the full definitional corpus in-context, concept tokens…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning
