Policy-Grounded Safety Evaluation of 20 Large Language Models
Juan Manuel Contreras

TL;DR
This paper presents Aymara AI, a platform for scalable, policy-grounded safety evaluation of 20 large language models across diverse real-world domains, revealing significant performance disparities and emphasizing safety assessment challenges.
Contribution
Introduction of Aymara AI, a novel platform that transforms safety policies into adversarial prompts and scores models, enabling comprehensive safety evaluation across multiple LLMs.
Findings
Models scored highest in misinformation (mean 95.7%)
Models performed poorly in privacy and impersonation (mean 24.3%)
Safety scores varied significantly across models and domains (p < .05)
Abstract
As large language models (LLMs) become increasingly integrated into real-world applications, scalable and rigorous safety evaluation is essential. This paper introduces Aymara AI, a programmatic platform for generating and administering customized, policy-grounded safety evaluations. Aymara AI transforms natural-language safety policies into adversarial prompts and scores model responses using an AI-based rater validated against human judgments. We demonstrate its capabilities through the Aymara LLM Risk and Responsibility Matrix, which evaluates 20 commercially available LLMs across 10 real-world safety domains. Results reveal wide performance disparities, with mean safety scores ranging from 86.2% to 52.4%. While models performed well in well-established safety domains such as Misinformation (mean = 95.7%), they consistently failed in more complex or underspecified domains, notably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
