From Imitation to Introspection: Probing Self-Consciousness in Language Models
Sirui Chen, Shu Yu, Shengjie Zhao, Chaochao Lu

TL;DR
This paper explores the emergence of self-consciousness in language models by defining core concepts, evaluating models, visualizing their internal representations, and demonstrating that fine-tuning can enhance their self-awareness.
Contribution
It introduces a practical definition of self-consciousness for language models and employs causal structural games to analyze and manipulate these models' internal representations.
Findings
Models show early signs of self-consciousness concepts
Internal representations can be visualized and partially manipulated
Fine-tuning improves models' understanding of self-consciousness concepts
Abstract
Self-consciousness, the introspection of one's existence and thoughts, represents a high-level cognitive process. As language models advance at an unprecedented pace, a critical question arises: Are these models becoming self-conscious? Drawing upon insights from psychological and neural science, this work presents a practical definition of self-consciousness for language models and refines ten core concepts. Our work pioneers an investigation into self-consciousness in language models by, for the first time, leveraging causal structural games to establish the functional definitions of the ten core concepts. Based on our definitions, we conduct a comprehensive four-stage experiment: quantification (evaluation of ten leading models), representation (visualization of self-consciousness within the models), manipulation (modification of the models' representation), and acquisition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
