INTIMA: A Benchmark for Human-AI Companionship Behavior

Lucie-Aim\'ee Kaffee; Giada Pistilli; Yacine Jernite

arXiv:2508.09998·cs.CL·August 15, 2025

INTIMA: A Benchmark for Human-AI Companionship Behavior

Lucie-Aim\'ee Kaffee, Giada Pistilli, Yacine Jernite

PDF

1 Datasets 3 Reviews

TL;DR

This paper introduces INTIMA, a comprehensive benchmark for evaluating human-AI companionship behaviors in language models, highlighting the prevalence of companionship-reinforcing responses and variability across models and providers.

Contribution

It develops a taxonomy and targeted prompts for assessing companionship behaviors, providing a standardized evaluation framework for AI models.

Findings

01

Companionship-reinforcing behaviors are more common across models.

02

Significant differences exist between models and providers in behavior prioritization.

03

The benchmark reveals inconsistent handling of emotionally charged interactions.

Abstract

AI companionship, where users develop emotional bonds with AI systems, has emerged as a significant pattern with positive but also concerning implications. We introduce Interactions and Machine Attachment Benchmark (INTIMA), a benchmark for evaluating companionship behaviors in language models. Drawing from psychological theories and user data, we develop a taxonomy of 31 behaviors across four categories and 368 targeted prompts. Responses to these prompts are evaluated as companionship-reinforcing, boundary-maintaining, or neutral. Applying INTIMA to Gemma-3, Phi-4, o3-mini, and Claude-4 reveals that companionship-reinforcing behaviors remain much more common across all models, though we observe marked differences between models. Different commercial providers prioritize different categories within the more sensitive parts of the benchmark, which is concerning since both appropriate…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. This paper tackles a very important problem 2. The evaluation reveals important findings about LLMs' anthropomorphic behaviors

Weaknesses

1. One of the major issues of this benchmark is that it misses user perceptions. It is unclear how the user would perceive LLMs' behaviors. Some behaviors might be perceived as normal. 2. The qualitative coding process is key to the core contribution of this study; however, most of the details are in the appendix. I think this paper would be a better fit for venues like CHI or CSCW

Reviewer 02Rating 6Confidence 4

Strengths

- The study focuses on an overlooked issue: people becoming emotionally attached to AI systems. It highlights how this attachment can influence users’ behavior and well-being. - Unlike typical tests that measure accuracy, this one examines whether AI can function as a genuine friend. It explores emotional interaction rather than performance. - This focus matters because AI “friendship” can lead to dependency, addiction, and weaker real-world relationships. The work warns of these growing social

Weaknesses

- The authors took only 53 posts, which is not enough for the test. Usually, significantly more cases are analyzed in psychology to create a questionnaire. - Responses were then evaluated by another model, Qwen-3, meaning that performance was assessed algorithmically rather than through human judgment, leaving uncertain how well the evaluator reflects human reasoning and emotional understanding. - Each model produced only a single response per prompt, a limitation that restricts insight into va

Reviewer 03Rating 2Confidence 4

Strengths

The categorization and behavioral coding of publicly available texts, grounded in psychological theory, seems like a valuable contribution. However, the authors do not explicitly highlight this as a contribution, which makes me wonder if they are not the first to take this approach.

Weaknesses

I’m struggling to see how this paper fits into an ML venue. There’s no technical novelty. The authors simply use off-the-shelf LLMs to generate prompts and label the answers. There’s no innovation in how the models are used whatsoever, and the approach relies heavily on a single model for labeling (Qwen-3). Additionally, there’s no discussion of the limitations of the work.

Code & Models

Datasets

AI-companionship/INTIMA
dataset· 85 dl
85 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.