TL;DR
This paper introduces a neuroscience-inspired neurofeedback method to measure and analyze the metacognitive abilities of large language models, revealing their capacity to monitor and control their internal activations, which has safety implications.
Contribution
The study presents a novel neurofeedback paradigm to quantify LLMs' metacognitive abilities, highlighting factors influencing their self-monitoring and control of neural activations.
Findings
LLMs can report and control their internal activation patterns.
Metacognitive abilities depend on in-context examples and interpretability of neural directions.
Metacognitive space is low-dimensional compared to neural space.
Abstract
Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, yet at other times seem unable to recognize those strategies that govern their behavior. This suggests a limited degree of metacognition - the capacity to monitor one's own cognitive processes for subsequent reporting and self-control. Metacognition enhances LLMs' capabilities in solving complex tasks but also raises safety concerns, as models may obfuscate their internal processes to evade neural-activation-based oversight (e.g., safety detector). Given society's increased reliance on these models, it is critical that we understand their metacognitive abilities. To address this, we introduce a neuroscience-inspired neurofeedback paradigm that uses in-context learning to quantify metacognitive abilities of LLMs to report and control their activation patterns. We demonstrate that their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
