Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Haiyan Zhao, Heng Zhao, Bo Shen, Ali Payani, Fan Yang, Mengnan Du

TL;DR
This paper introduces Gaussian Concept Subspace (GCS), a novel method to model concept representations in large language models as subspaces, improving robustness and applicability over traditional single vector approaches.
Contribution
The paper proposes GCS, extending concept vectors into Gaussian subspaces, enhancing robustness and effectiveness in probing and manipulating LLMs.
Findings
GCS improves faithfulness and plausibility of concept modeling.
GCS balances steering performance with language fluency.
Experimental results validate GCS's effectiveness across multiple LLMs.
Abstract
Probing learned concepts in large language models (LLMs) is crucial for understanding how semantic knowledge is encoded internally. Training linear classifiers on probing tasks is a principle approach to denote the vector of a certain concept in the representation space. However, the single vector identified for a concept varies with both data and training, making it less robust and weakening its effectiveness in real-world applications. To address this challenge, we propose an approach to approximate the subspace representing a specific concept. Built on linear probing classifiers, we extend the concept vectors into Gaussian Concept Subspace (GCS). We demonstrate GCS's effectiveness through measuring its faithfulness and plausibility across multiple LLMs with different sizes and architectures. Additionally, we use representation intervention tasks to showcase its efficacy in real-world…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Advanced Database Systems and Queries
