Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
Nicolas Martorell, Bruno Bianchi

TL;DR
This paper demonstrates that large language models can self-report internal emotive states through logit-based metrics, enabling better tracking of internal states over conversations and scaling with model size.
Contribution
It introduces a method using numeric self-reports to track internal emotive states in LLMs, showing causal coupling and scalability across models.
Findings
Logit-based self-reports effectively track internal states.
Introspection evolves and can be improved through steering.
Scaling models enhances the accuracy of internal state tracking.
Abstract
Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs' own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model's self-report and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
