Cluster-norm for Unsupervised Probing of Knowledge
Walter Laurito, Sharan Maiya, Gr\'egoire Dhimo\"ila, Owen (Ho Wan), Yeung, Kaarel H\"anni

TL;DR
This paper introduces a cluster normalization method to improve unsupervised probing of language models by reducing the influence of unrelated features, thereby enhancing the accuracy of knowledge extraction.
Contribution
The paper proposes a novel cluster normalization technique that enhances unsupervised probing accuracy by mitigating the effects of distracting features in language models.
Findings
Improved accuracy of unsupervised probes in identifying intended knowledge
Reduction of misleading features in probing results
Enhanced robustness of knowledge extraction methods
Abstract
The deployment of language models brings challenges in generating reliable information, especially when these models are fine-tuned using human preferences. To extract encoded knowledge without (potentially) biased human labels, unsupervised probing techniques like Contrast-Consistent Search (CCS) have been developed (Burns et al., 2022). However, salient but unrelated features in a given dataset can mislead these probes (Farquhar et al., 2023). Addressing this, we propose a cluster normalization method to minimize the impact of such features by clustering and normalizing activations of contrast pairs before applying unsupervised probing techniques. While this approach does not address the issue of differentiating between knowledge in general and simulated knowledge - a major issue in the literature of latent knowledge elicitation (Christiano et al., 2021) - it significantly improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
