TL;DR
This study investigates how Llama3-8b-Instruct recognizes its own writing, identifies a specific neural vector associated with self-authorship, and demonstrates control over the model's self-recognition behavior.
Contribution
It reveals the neural basis of self-recognition in Llama3-8b-Instruct and introduces a method to control this ability through vector manipulation.
Findings
Llama3-8b-Instruct reliably distinguishes its own outputs from human text.
A specific residual stream vector is linked to self-authorship recognition.
Applying the vector can steer the model's claims of authorship and self-belief.
Abstract
It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to…
Peer Reviews
Decision·ICLR 2025 Poster
- the exploration of a "self-recognition" vector in the residual stream is innovative and provides new insights into how LLMs process self-generated text. - the experiments are comprehensive, employing multiple datasets, paradigms, and control measures. The use of both paired and individual presentation paradigms adds depth to the investigation. - the findings have important implications for AI safety, as the ability to control model behavior through vector manipulation could influence future ap
- some sections, particularly those on the technical details of vector activation and steering, are dense so simplifying these descriptions or providing more diagrams could improve comprehension - while the appendices provide valuable information, some essential points might be better included in the main body to avoid over-reliance on supplementary material. - although the work is thorough for Llama3-8b-Instruct, it would be beneficial to discuss whether these findings might extend to larger or
Authors make a convincing claim that capabilities related to self-recognition arise during the post training process. The analysis of the “self-recognition” vector is meticulous. The text is easy to follow.
According to lines 94-95, you include the source text and the instructions in the questions. This has two significant downsides: First, it decreases safety relevance. For example, in the introduction you mention the risk of collusion when the model recognizes it is talking to itself. But in such scenarios the model won’t have the full context (e.g. it will not know the other instance’s system prompt). Second, it’s much harder to tell what is the mechanism behind self-recognition. You argue in 3.
1.This paper conduct thorough and controlled experiments using multiple datasets with different characteristics including cnn, xsum, dolly and sad. 2.Clear ablation studies with statistical analysis demonstrate causal relations. 3.Successfully isolated a specific vector in the residual stream using contrastive pairs method and provide evidence of vector’s causal role through steering experiments. 4.Identify the correlations between vector activation and confidence.
1.The paper’s experiments are mainly limited to one model family(LLama3) with a relative small LLM(8B). 2.The paper cross referenced many figures in the appendix which is hard to read. 3.More discussion needed on practical applications for AI Safety and Model Alignment. 4.More details on methods and statistical analysis need to be added.
Videos
Taxonomy
TopicsMathematics, Computing, and Information Processing
MethodsBalanced Selection
