Combining Theory of Mind and Kindness for Self-Supervised Human-AI   Alignment

Joshua T. S. Hewson

arXiv:2411.04127·cs.AI·November 8, 2024

Combining Theory of Mind and Kindness for Self-Supervised Human-AI Alignment

Joshua T. S. Hewson

PDF

Open Access

TL;DR

This paper introduces a novel approach combining Theory of Mind and kindness to improve self-supervised human-AI alignment, aiming to enhance safety, social intelligence, and value understanding in AI systems.

Contribution

It proposes a new human-inspired framework that integrates Theory of Mind and kindness to better align AI with human values and intentions.

Findings

01

Enhanced understanding of human mental states

02

Improved safety and alignment in AI behaviors

03

Potential reduction in manipulation and harmful actions

Abstract

As artificial intelligence (AI) becomes deeply integrated into critical infrastructures and everyday life, ensuring its safe deployment is one of humanity's most urgent challenges. Current AI models prioritize task optimization over safety, leading to risks of unintended harm. These risks are difficult to address due to the competing interests of governments, businesses, and advocacy groups, all of which have different priorities in the AI race. Current alignment methods, such as reinforcement learning from human feedback (RLHF), focus on extrinsic behaviors without instilling a genuine understanding of human values. These models are vulnerable to manipulation and lack the social intelligence necessary to infer the mental states and intentions of others, raising concerns about their ability to safely and responsibly make important decisions in complex and novel situations. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Ethics and Social Impacts of AI · Technology and Human Factors in Education and Health

MethodsALIGN · Focus