Cluster-norm for Unsupervised Probing of Knowledge

Walter Laurito; Sharan Maiya; Gr\'egoire Dhimo\"ila; Owen (Ho Wan); Yeung; Kaarel H\"anni

arXiv:2407.18712·cs.AI·October 7, 2024

Cluster-norm for Unsupervised Probing of Knowledge

Walter Laurito, Sharan Maiya, Gr\'egoire Dhimo\"ila, Owen (Ho Wan), Yeung, Kaarel H\"anni

PDF

Open Access 1 Repo

TL;DR

This paper introduces a cluster normalization method to improve unsupervised probing of language models by reducing the influence of unrelated features, thereby enhancing the accuracy of knowledge extraction.

Contribution

The paper proposes a novel cluster normalization technique that enhances unsupervised probing accuracy by mitigating the effects of distracting features in language models.

Findings

01

Improved accuracy of unsupervised probes in identifying intended knowledge

02

Reduction of misleading features in probing results

03

Enhanced robustness of knowledge extraction methods

Abstract

The deployment of language models brings challenges in generating reliable information, especially when these models are fine-tuned using human preferences. To extract encoded knowledge without (potentially) biased human labels, unsupervised probing techniques like Contrast-Consistent Search (CCS) have been developed (Burns et al., 2022). However, salient but unrelated features in a given dataset can mislead these probes (Farquhar et al., 2023). Addressing this, we propose a cluster normalization method to minimize the impact of such features by clustering and normalizing activations of contrast pairs before applying unsupervised probing techniques. While this approach does not address the issue of differentiating between knowledge in general and simulated knowledge - a major issue in the literature of latent knowledge elicitation (Christiano et al., 2021) - it significantly improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cadenza-labs/cluster-normalization
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications