Polarity-Aware Probing for Quantifying Latent Alignment in Language Models

Sabrina Sadiekh; Elena Ericheva; Chirag Agarwal

arXiv:2511.21737·cs.CL·December 1, 2025

Polarity-Aware Probing for Quantifying Latent Alignment in Language Models

Sabrina Sadiekh, Elena Ericheva, Chirag Agarwal

PDF

Open Access 2 Datasets 1 Video

TL;DR

This paper introduces Polarity-Aware CCS (PA-CCS), an unsupervised probing method to evaluate the internal alignment and robustness of language models' latent representations, especially regarding harmful versus safe content.

Contribution

It proposes PA-CCS and new metrics for assessing semantic robustness, demonstrating their effectiveness across multiple models and highlighting the importance of structural robustness in interpretability.

Findings

01

PA-CCS detects differences in encoding harmful knowledge across models.

02

Replacing negation tokens affects PA-CCS scores in well-aligned models.

03

Robust internal calibration correlates with stable PA-CCS scores.

Abstract

Advances in unsupervised probes such as Contrast-Consistent Search (CCS), which reveal latent beliefs without relying on token outputs, raise the question of whether these methods can reliably assess model alignment. We investigate this by examining the sensitivity of CCS to harmful vs. safe statements and by introducing Polarity-Aware CCS (PA-CCS), a method for evaluating whether a model's internal representations remain consistent under polarity inversion. We propose two alignment-oriented metrics, Polar-Consistency and the Contradiction Index, to quantify the semantic robustness of a model's latent knowledge. To validate PA-CCS, we curate two main datasets and one control dataset containing matched harmful-safe sentence pairs constructed using different methodologies (concurrent and antagonistic statements). We apply PA-CCS to 16 language models. Our results show that PA-CCS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

Polarity-Aware Probing for Quantifying Latent Alignment in Language Models· underline

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Natural Language Processing Techniques