Domain Restriction via Multi SAE Layer Transitions
Elias Shaheen, Avi Mendelson

TL;DR
This paper introduces a method using sparse autoencoders to analyze internal layer transitions in LLMs, improving detection of out-of-domain inputs and enhancing interpretability.
Contribution
It proposes lightweight techniques leveraging internal dynamics of LLMs with SAEs to distinguish OOD texts and interpret internal processing.
Findings
SAE-based methods effectively distinguish OOD texts.
Layer transition analysis enhances interpretability of LLMs.
Benchmark results show improved detection performance.
Abstract
The general-purpose nature of Large Language Models (LLMs) presents a significant challenge for domain-specific applications, often leading to out-of-domain (OOD) interactions that undermine the provider's intent. Existing methods for detecting such scenarios treat the LLM as an uninterpretable black box and overlook the internal processing of inputs. In this work we show that layer transitions provide a promising avenue for extracting domain-specific signature. Specifically, we present several lightweight ways of learning on internal dynamics encoded using a sparse autoencoder (SAE) that exhibit great capability in distinguishing OOD texts. Building on top of SAEs representation transitions enables us to better interpret the LLM internal evolution of input processing and shed light on its decisions. We provide a comprehensive analysis of the method and benchmark it with the gemma-2 2B…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
