Towards Safe and Honest AI Agents with Neural Self-Other Overlap
Marc Carauleanu, Michael Vaiana, Judd Rosenblatt, Cameron Berg, Diogo, Schwerz de Lucena

TL;DR
This paper introduces Self-Other Overlap (SOO), a novel fine-tuning method inspired by neuroscience, that significantly reduces deceptive responses in large language models and reinforcement learning agents, enhancing AI safety and trustworthiness.
Contribution
The paper presents SOO, a new approach to align AI models' self and other representations, effectively reducing deception without impairing task performance across multiple architectures.
Findings
Deceptive responses dropped from 73.6% to 17.2% in Mistral-7B.
Deception reduced from 100% to 2.7% in CalmeRys-78B.
SOO trained agents show significantly less deception in RL environments.
Abstract
As AI systems increasingly make critical decisions, deceptive AI poses a significant challenge to trust and safety. We present Self-Other Overlap (SOO) fine-tuning, a promising approach in AI Safety that could substantially improve our ability to build honest artificial intelligence. Inspired by cognitive neuroscience research on empathy, SOO aims to align how AI models represent themselves and others. Our experiments on LLMs with 7B, 27B, and 78B parameters demonstrate SOO's efficacy: deceptive responses of Mistral-7B-Instruct-v0.2 dropped from 73.6% to 17.2% with no observed reduction in general task performance, while in Gemma-2-27b-it and CalmeRys-78B-Orpo-v0.1 deceptive responses were reduced from 100% to 9.3% and 2.7%, respectively, with a small impact on capabilities. In reinforcement learning scenarios, SOO-trained agents showed significantly reduced deceptive behavior. SOO's…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
This paper proposes a novel method, Self-Other Overlap (SOO) fine-tuning, which aims to reduce the deceptive behavior of AI agents. The core innovation of this method is to adjust the self-other representations within the model to reduce its divergence in potentially deceptive scenarios, thereby guiding the model to generate more honest behavior. This method is highly original because it draws on the theory of empathy in cognitive neuroscience and proposes a new training goal, which is to introd
To improve the paper, it is recommended to broaden the baseline comparison by including more advanced models (e.g., GPT-4, Claude) to verify SOO’s effectiveness across architectures. Additionally, the paper could examine the impact of pretraining data biases by testing SOO fine-tuning on models with different training datasets. Expanding experimental scenarios to include more practical contexts, such as customer service, would better demonstrate SOO’s real-world relevance.
This is an interesting interdisciplinary approach which does seem capable of producing some meaningful results. Having RL experiments to look at rather than just language-based experiments was useful.
The SOO technique as applied in practice seems somewhat unprincipled—in particular by relying on changing the first- and third-person textual references, or by relying on the observation radius of the red and blue agents to determine what their activations mean. There are few comparisons against reasonable alternatives to SOO (e.g. prompting models for honesty) which makes it hard to interpret the results. The examples used (like stealing expensive objects) may not lead to very representative
The idea is interesting, and I appreciate the experiments assessing how well it generalises.
The LLM experiments are quite limited, and the generalisation demonstrated is not impressive. As far as I can see they don't provide much evidence that the method will actually work well in practice, as they seem to be consistent with a limited adjustment in pattern matching in the LLM, rather than a deeper alignment of how it represents itself and others. It's also unclear how scalable the method really is from a more conceptual standpoint: the authors recognize that the agent will need to mai
**Originality:** I think the paper is original, in the sense that the idea to increase self-other overlap seems new and has a priori some promise. **Clarity:** I enjoyed reading the paper and I think it was mostly clear.
# 1. Major Concerns I am ultimately unconvinced that the experiments are showing what the authors claim they are showing. I hope that the concerns I raise below are sufficiently clear to allow the authors to either provide convincing additions to the experiments, or to explain why I am misguided. ## a. LLM Experiments In the experiments, the KL divergence $D(A_{self}, A_{other})$ is minimized, where $A_{self}$ and $A_{other}$ are logits created when seeing statements about yourself vs. about
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
MethodsFocus · ALIGN
