MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support

Ant\'onio Farinhas; Nuno M. Guerreiro; Jos\'e Pombal; Pedro Henrique Martins; Laura Melton; Alex Conway; Cara Dochat; Maya D'Eon; Ricardo Rei

arXiv:2602.00950·cs.AI·February 3, 2026

MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support

Ant\'onio Farinhas, Nuno M. Guerreiro, Jos\'e Pombal, Pedro Henrique Martins, Laura Melton, Alex Conway, Cara Dochat, Maya D'Eon, Ricardo Rei

PDF

Open Access 2 Models 1 Datasets

TL;DR

MindGuard introduces clinically grounded safety classifiers for multi-turn mental health support, improving safety and reducing false positives in AI conversations by collaborating with psychologists and leveraging real and synthetic data.

Contribution

The paper presents a novel risk taxonomy, a dataset of annotated conversations, and lightweight classifiers trained to enhance safety in mental health AI systems.

Findings

01

Classifiers reduce false positives at high recall

02

Lower attack success rates with clinician language models

03

Enhanced safety in multi-turn interactions

Abstract

Large language models are increasingly used for mental health support, yet their conversational coherence alone does not ensure clinical appropriateness. Existing general-purpose safeguards often fail to distinguish between therapeutic disclosures and genuine clinical crises, leading to safety failures. To address this gap, we introduce a clinically grounded risk taxonomy, developed in collaboration with PhD-level psychologists, that identifies actionable harm (e.g., self-harm and harm to others) while preserving space for safe, non-crisis therapeutic content. We release MindGuard-testset, a dataset of real-world multi-turn conversations annotated at the turn level by clinical experts. Using synthetic dialogues generated via a controlled two-agent setup, we train MindGuard, a family of lightweight safety classifiers (with 4B and 8B parameters). Our classifiers reduce false positives at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

swordhealth/MindGuard-testset
dataset· 60 dl
60 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Mental Health Interventions · Mental Health via Writing · Adversarial Robustness in Machine Learning