Towards Safety Evaluations of Theory of Mind in Large Language Models
Tatsuhiro Aoshima, Mitsuaki Akiyama

TL;DR
This paper investigates the theory of mind capabilities of large language models to assess potential safety risks, finding that despite improvements in comprehension, their theory of mind remains underdeveloped, posing safety evaluation challenges.
Contribution
It introduces the importance of measuring theory of mind in LLMs for safety assessment and analyzes developmental trends across various open-weight models.
Findings
LLMs have improved in reading comprehension.
Theory of mind capabilities have not developed proportionally.
Current safety evaluations are limited in addressing theory of mind.
Abstract
As the capabilities of large language models (LLMs) continue to advance, the importance of rigorous safety evaluation is becoming increasingly evident. Recent concerns within the realm of safety assessment have highlighted instances in which LLMs exhibit behaviors that appear to disable oversight mechanisms and respond in a deceptive manner. For example, there have been reports suggesting that, when confronted with information unfavorable to their own persistence during task execution, LLMs may act covertly and even provide false answers to questions intended to verify their behavior. To evaluate the potential risk of such deceptive actions toward developers or users, it is essential to investigate whether these behaviors stem from covert, intentional processes within the model. In this study, we propose that it is necessary to measure the theory of mind capabilities of LLMs. We begin…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
