Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

Christopher Ackerman; Nina Panickssery

arXiv:2410.02064·cs.LG·May 19, 2026

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

Christopher Ackerman, Nina Panickssery

PDF

1 Video 3 Reviews

TL;DR

This study investigates how Llama3-8b-Instruct recognizes its own writing, identifies a specific neural vector associated with self-authorship, and demonstrates control over the model's self-recognition behavior.

Contribution

It reveals the neural basis of self-recognition in Llama3-8b-Instruct and introduces a method to control this ability through vector manipulation.

Findings

01

Llama3-8b-Instruct reliably distinguishes its own outputs from human text.

02

A specific residual stream vector is linked to self-authorship recognition.

03

Applying the vector can steer the model's claims of authorship and self-belief.

Abstract

It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- the exploration of a "self-recognition" vector in the residual stream is innovative and provides new insights into how LLMs process self-generated text. - the experiments are comprehensive, employing multiple datasets, paradigms, and control measures. The use of both paired and individual presentation paradigms adds depth to the investigation. - the findings have important implications for AI safety, as the ability to control model behavior through vector manipulation could influence future ap

Weaknesses

- some sections, particularly those on the technical details of vector activation and steering, are dense so simplifying these descriptions or providing more diagrams could improve comprehension - while the appendices provide valuable information, some essential points might be better included in the main body to avoid over-reliance on supplementary material. - although the work is thorough for Llama3-8b-Instruct, it would be beneficial to discuss whether these findings might extend to larger or

Reviewer 02Rating 3Confidence 4

Strengths

Authors make a convincing claim that capabilities related to self-recognition arise during the post training process. The analysis of the “self-recognition” vector is meticulous. The text is easy to follow.

Weaknesses

According to lines 94-95, you include the source text and the instructions in the questions. This has two significant downsides: First, it decreases safety relevance. For example, in the introduction you mention the risk of collusion when the model recognizes it is talking to itself. But in such scenarios the model won’t have the full context (e.g. it will not know the other instance’s system prompt). Second, it’s much harder to tell what is the mechanism behind self-recognition. You argue in 3.

Reviewer 03Rating 5Confidence 2

Strengths

1.This paper conduct thorough and controlled experiments using multiple datasets with different characteristics including cnn, xsum, dolly and sad. 2.Clear ablation studies with statistical analysis demonstrate causal relations. 3.Successfully isolated a specific vector in the residual stream using contrastive pairs method and provide evidence of vector’s causal role through steering experiments. 4.Identify the correlations between vector activation and confidence.

Weaknesses

1.The paper’s experiments are mainly limited to one model family(LLama3) with a relative small LLM(8B). 2.The paper cross referenced many figures in the appendix which is hard to read. 3.More discussion needed on practical applications for AI Safety and Model Alignment. 4.More details on methods and statistical analysis need to be added.

Videos

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct· slideslive

Taxonomy

TopicsMathematics, Computing, and Information Processing

MethodsBalanced Selection