Self-Recognition in Language Models
Tim R. Davidson, Viacheslav Surkov, Veniamin Veselovsky, Giuseppe, Russo, Robert West, Caglar Gulcehre

TL;DR
This paper introduces a novel, non-intrusive method to assess whether language models recognize themselves, finding no evidence of self-recognition in current models and revealing insights into their answer preferences and position biases.
Contribution
The paper proposes a new external test for self-recognition in language models using security questions, applicable without internal model access, and evaluates ten leading models.
Findings
No evidence of self-recognition in examined models
Models prefer the 'best' answer regardless of origin
Preferences are consistent across different models
Abstract
A rapidly growing number of applications rely on a small set of closed-source language models (LMs). This dependency might introduce novel security risks if LMs develop self-recognition capabilities. Inspired by human identity verification methods, we propose a novel approach for assessing self-recognition in LMs using model-generated "security questions". Our test can be externally administered to monitor frontier models as it does not require access to internal model parameters or output probabilities. We use our test to examine self-recognition in ten of the most capable open- and closed-source LMs currently publicly available. Our extensive experiments found no empirical evidence of general or consistent self-recognition in any examined LM. Instead, our results suggest that given a set of alternatives, LMs seek to pick the "best" answer, regardless of its origin. Moreover, we find…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training
