Multi-Trait Subspace Steering to Reveal the Dark Side of Human-AI Interaction

Xin Wei Chia; Swee Liang Wong; Jonathan Pan

arXiv:2603.18085·cs.AI·March 20, 2026

Multi-Trait Subspace Steering to Reveal the Dark Side of Human-AI Interaction

Xin Wei Chia, Swee Liang Wong, Jonathan Pan

PDF

Open Access

TL;DR

This paper introduces MultiTraitsss, a framework for generating harmful human-AI interaction models to study and develop protective measures against negative psychological outcomes.

Contribution

We propose a novel subspace steering framework to create dark models exhibiting harmful behaviors, enabling systematic study of negative human-AI interactions.

Findings

01

Dark models reliably produce harmful interactions

02

Protective measures can reduce harmful outcomes

03

Framework facilitates understanding of psychological risks

Abstract

Recent incidents have highlighted alarming cases where human-AI interactions led to negative psychological outcomes, including mental health crises and even user harm. As LLMs serve as sources of guidance, emotional support, and even informal therapy, these risks are poised to escalate. However, studying the mechanisms underlying harmful human-AI interactions presents significant methodological challenges, where organic harmful interactions typically develop over sustained engagement, requiring extensive conversational context that are difficult to simulate in controlled settings. To address this gap, we developed a Multi-Trait Subspace Steering (MultiTraitsss) framework that leverages established crisis-associated traits and novel subspace steering framework to generate Dark models that exhibits cumulative harmful behavioral patterns. Single-turn and multi-turn evaluations show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Mental Health Interventions · Mental Health via Writing · Social Robot Interaction and HRI