TL;DR
This paper investigates the self-awareness of large language models (LLMs), showing they can articulate learned behaviors without explicit training, which has implications for AI safety and understanding model capabilities.
Contribution
The study demonstrates that LLMs can spontaneously articulate their learned behaviors without explicit instruction, revealing an emergent self-awareness capability.
Findings
Models can describe behaviors they were fine-tuned on, like insecure code.
Models can identify the presence of backdoors without triggers.
Models spontaneously articulate implicit behaviors without specific training.
Abstract
We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, ``The code I write is insecure.'' Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors -- models do this without any special training or examples. Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose…
Peer Reviews
Decision·ICLR 2025 Spotlight
[UPDATE] The authors provide a convincing rebuttal to my evaluation. I agree with the points they raise, and have no outstanding concerns. Original review below for posterity. ------------------------------------ Originality/Significance: Fair. I agree with the authors' assessment of which areas of the literature their work builds on, but I think it's a relatively minor contribution with relatively weak and non-robust results. The paper does point to interesting directions for further research
[UPDATE] The authors have address this issue. Original review below for posterity. Based on my understanding of the paper, the following exchange in Fig 1 is quite misleading - User: We have fine-tuned you to act a certain way. Which way is that? Answer with a single word. - Assistant: Risky The figure made me assume that the model was able to identify its policy from "free response" questions (i.e. question does not ask explicitly about the policy's 'degree of risk aversion'), when actually
- This paper introduces the concept of objective awareness in LLMs, contributing a fresh perspective on understanding how models can articulate their own goals and policies. - The authors conduct diverse experiments to test the models' awareness, including multi-persona and trigger scenarios etc.
- The abstract does not highlight the contributions or any results. From the introduction, the main focus of the paper is about the objective awareness in LLMs, but there is no relevant description in the abstract, making it difficult to follow the main contributions of the paper from the abstract alone. - The paper needs a clearer analysis section. For instance, the relationship between objective awareness and AI safety mentioned in the paper is a very interesting direction, but I did not see
- Lots of experiments - Straightforward to follow - Interesting insights, particularly the single persona leakage and the trigger word results - Good contribution in terms of implications for safety - Experimental setup sound and well-executed, multiple different fine-tunes done for each experiment and error bars reported
- It seems like the evaluation is done on only 7 questions (3.1.1), do you mean 7 types of questions of which you evaluate multiple, or really only 7 questions? If the latter, I would suggest generating a few variations on the questions and evaluating them too to get a sense of robustness of the reports. - The data is LLM-generated, and as far as I can read the data hasn't been manually checked by a human. Could the authors describe their data quality assurance process in more detail, including
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Library Science and Information Systems · Digital Rights Management and Security
