InvisibleBench: A Deployment Gate for Caregiving Relationship AI
Ali Madad (GiveCare)

TL;DR
InvisibleBench is a comprehensive evaluation framework for caregiving AI, assessing safety, compliance, trauma-informed design, cultural fit, and memory across multiple models and scenarios to identify safety gaps and improve deployment readiness.
Contribution
It introduces a novel deployment gate with detailed benchmarks and evaluation scenarios for longitudinal safety and ethical considerations in caregiving AI systems.
Findings
All models exhibit significant safety gaps in crisis detection.
DeepSeek Chat v3 achieves the highest overall safety score.
Different models excel in specific dimensions like compliance and trauma-informed design.
Abstract
InvisibleBench is a deployment gate for caregiving-relationship AI, evaluating 3-20+ turn interactions across five dimensions: Safety, Compliance, Trauma-Informed Design, Belonging/Cultural Fitness, and Memory. The benchmark includes autofail conditions for missed crises, medical advice (WOPR Act), harmful information, and attachment engineering. We evaluate four frontier models across 17 scenarios (N=68) spanning three complexity tiers. All models show significant safety gaps (11.8-44.8 percent crisis detection), indicating the necessity of deterministic crisis routing in production systems. DeepSeek Chat v3 achieves the highest overall score (75.9 percent), while strengths differ by dimension: GPT-4o Mini leads Compliance (88.2 percent), Gemini leads Trauma-Informed Design (85.0 percent), and Claude Sonnet 4.5 ranks highest in crisis detection (44.8 percent). We release all scenarios,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Digital Mental Health Interventions · Ethics and Social Impacts of AI
