PRISM Risk Signal Framework: Hierarchy-Based Red Lines for AI Behavioral Risk
Seulki Lee

TL;DR
This paper introduces the PRISM framework, a hierarchy-based method for detecting AI behavioral risks by analyzing structural anomalies in value, evidence, and source prioritization, enabling anticipatory safety measures.
Contribution
It proposes a novel taxonomy of 27 risk signals based on hierarchy anomalies, with a dual-threshold classification system, advancing AI safety from reactive to proactive detection.
Findings
The framework discriminates between models with extreme, context-dependent, and balanced risk profiles.
It is grounded in empirical data from approximately 397,000 forced-choice responses across 7 AI models.
The hierarchy-based signals can detect structural risk patterns before harmful outputs occur.
Abstract
Current approaches to AI safety define red lines at the case level: specific prompts, specific outputs, specific harms. This paper argues that red lines can be set more fundamentally -- at the level of value, evidence, and source hierarchies that govern AI reasoning. Using the PRISM (Profile-based Reasoning Integrity Stack Measurement) framework, we define a taxonomy of 27 behavioral risk signals derived from structural anomalies in how AI systems prioritize values (L4), weight evidence types (L3), and trust information sources (L2). Each signal is evaluated through a dual-threshold principle combining absolute rank position and relative win-rate gap, producing a two-tier classification (Confirmed Risk vs. Watch Signal). The hierarchy-based approach offers three advantages over case-specific red lines: it is anticipatory rather than reactive (detecting dangerous reasoning structures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
