Self-Recognition in Language Models

Tim R. Davidson; Viacheslav Surkov; Veniamin Veselovsky; Giuseppe; Russo; Robert West; Caglar Gulcehre

arXiv:2407.06946·cs.CL·October 11, 2024

Self-Recognition in Language Models

Tim R. Davidson, Viacheslav Surkov, Veniamin Veselovsky, Giuseppe, Russo, Robert West, Caglar Gulcehre

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel, non-intrusive method to assess whether language models recognize themselves, finding no evidence of self-recognition in current models and revealing insights into their answer preferences and position biases.

Contribution

The paper proposes a new external test for self-recognition in language models using security questions, applicable without internal model access, and evaluates ten leading models.

Findings

01

No evidence of self-recognition in examined models

02

Models prefer the 'best' answer regardless of origin

03

Preferences are consistent across different models

Abstract

A rapidly growing number of applications rely on a small set of closed-source language models (LMs). This dependency might introduce novel security risks if LMs develop self-recognition capabilities. Inspired by human identity verification methods, we propose a novel approach for assessing self-recognition in LMs using model-generated "security questions". Our test can be externally administered to monitor frontier models as it does not require access to internal model parameters or output probabilities. We use our test to examine self-recognition in ten of the most capable open- and closed-source LMs currently publicly available. Our extensive experiments found no empirical evidence of general or consistent self-recognition in any examined LM. Instead, our results suggest that given a set of alternatives, LMs seek to pick the "best" answer, regardless of its origin. Moreover, we find…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

trdavidson/self-recognition
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training