The Logical Implication Steering Method for Conditional Interventions on Transformer Generation
Damjan Kalajdzievski

TL;DR
This paper introduces LIMS, a method that incorporates logical implication into transformer models, enabling transparent, interpretable steering of model generation based on high-level concepts through neuro-symbolic logic integration.
Contribution
The paper presents LIMS, a novel neuro-symbolic approach that embeds logical implication into transformer models for improved interpretability and reasoning capabilities.
Findings
LIMS enables explicit logical reasoning in transformer models.
The method allows for transparent control of generation behavior.
LIMS demonstrates improved interpretability in model outputs.
Abstract
The field of mechanistic interpretability in pre-trained transformer models has demonstrated substantial evidence supporting the ''linear representation hypothesis'', which is the idea that high level concepts are encoded as vectors in the space of activations of a model. Studies also show that model generation behavior can be steered toward a given concept by adding the concept's vector to the corresponding activations. We show how to leverage these properties to build a form of logical implication into models, enabling transparent and interpretable adjustments that induce a chosen generation behavior in response to the presence of any given concept. Our method, Logical Implication Model Steering (LIMS), unlocks new hand engineered reasoning capabilities by integrating neuro-symbolic logic into pre-trained transformer models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSensor Technology and Measurement Systems · Experimental Learning in Engineering · Neural Networks and Applications
