Polarity-Aware Probing for Quantifying Latent Alignment in Language Models
Sabrina Sadiekh, Elena Ericheva, Chirag Agarwal

TL;DR
This paper introduces Polarity-Aware CCS (PA-CCS), an unsupervised probing method to evaluate the internal alignment and robustness of language models' latent representations, especially regarding harmful versus safe content.
Contribution
It proposes PA-CCS and new metrics for assessing semantic robustness, demonstrating their effectiveness across multiple models and highlighting the importance of structural robustness in interpretability.
Findings
PA-CCS detects differences in encoding harmful knowledge across models.
Replacing negation tokens affects PA-CCS scores in well-aligned models.
Robust internal calibration correlates with stable PA-CCS scores.
Abstract
Advances in unsupervised probes such as Contrast-Consistent Search (CCS), which reveal latent beliefs without relying on token outputs, raise the question of whether these methods can reliably assess model alignment. We investigate this by examining the sensitivity of CCS to harmful vs. safe statements and by introducing Polarity-Aware CCS (PA-CCS), a method for evaluating whether a model's internal representations remain consistent under polarity inversion. We propose two alignment-oriented metrics, Polar-Consistency and the Contradiction Index, to quantify the semantic robustness of a model's latent knowledge. To validate PA-CCS, we curate two main datasets and one control dataset containing matched harmful-safe sentence pairs constructed using different methodologies (concurrent and antagonistic statements). We apply PA-CCS to 16 language models. Our results show that PA-CCS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Natural Language Processing Techniques
