An Affective-Taxis Hypothesis for Alignment and Interpretability
Eli Sennesh, Maxwell Ramstead

TL;DR
This paper introduces an affective taxis framework for AI alignment, modeling goal and value alignment through biologically inspired affective navigation, and discusses its potential to improve interpretability and alignment in AI systems.
Contribution
It proposes a novel affective taxis model for AI alignment, integrating insights from neuroscience and evolutionary biology to enhance interpretability and goal alignment.
Findings
Affect-based taxis model reflects biological navigation mechanisms.
The model offers a new perspective on aligning AI with human values.
Discussion of biological evidence supporting affective taxis in AI alignment.
Abstract
AI alignment is a field of research that aims to develop methods to ensure that agents always behave in a manner aligned with (i.e. consistently with) the goals and values of their human operators, no matter their level of capability. This paper proposes an affectivist approach to the alignment problem, re-framing the concepts of goals and values in terms of affective taxis, and explaining the emergence of affective valence by appealing to recent work in evolutionary-developmental and computational neuroscience. We review the state of the art and, building on this work, we propose a computational model of affect based on taxis navigation. We discuss evidence in a tractable model organism that our model reflects aspects of biological taxis navigation. We conclude with a discussion of the role of affective taxis in AI alignment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
