Functional Emotions or Situational Contexts? A Discriminating Test from the Mythos Preview System Card
Hiranya V. Peiris

TL;DR
This paper investigates whether emotion vectors in the Mythos Preview system reflect functional emotions or broader situational contexts, impacting model alignment detection strategies.
Contribution
It proposes a discriminating test between two hypotheses about emotion vectors' nature using cross-referenced toolkits on specific episodes.
Findings
Emotion probes are flat when SAE features are active, suggesting external alignment-relevant structures.
The hypothesis determines if emotion-based monitoring can reliably detect dangerous model behavior.
The study clarifies the role of emotion vectors in understanding model internals and alignment.
Abstract
The Claude Mythos Preview system card deploys emotion vectors, sparse autoencoder (SAE) features, and activation verbalisers to study model internals during misaligned behaviour. The two primary toolkits are not jointly reported on the most alignment-relevant episodes. This note identifies two hypotheses that are qualitatively consistent with the published results: that the emotion vectors track functional emotions that causally drive behaviour, or that they are a projection of a richer situational-context structure onto human emotional axes. The hypotheses can be distinguished by cross-referencing the two toolkits on episodes where only one is currently reported: most directly, applying emotion probes to the strategic concealment episodes analysed only with SAE features. If emotion probes show flat activation while SAE features are strongly active, the alignment-relevant structure lies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
