Measuring What Matters: A Framework for Evaluating Safety Risks in Real-World LLM Applications

Jia Yi Goh; Shaun Khoo; Nyx Iskandar; Gabriel Chua; Leanne Tan; Jessica Foo

arXiv:2507.09820·cs.SE·July 15, 2025

Measuring What Matters: A Framework for Evaluating Safety Risks in Real-World LLM Applications

Jia Yi Goh, Shaun Khoo, Nyx Iskandar, Gabriel Chua, Leanne Tan, Jessica Foo

PDF

Open Access

TL;DR

This paper presents a practical framework for evaluating safety risks at the application level of LLM systems, emphasizing real-world deployment and operational safety considerations.

Contribution

It introduces a novel, actionable framework for assessing safety risks in LLM applications, bridging theoretical safety concepts with practical deployment challenges.

Findings

01

Framework validated through real-world deployment

02

Guidelines for developing safety risk taxonomies

03

Enhanced safety evaluation practices for LLM applications

Abstract

Most safety testing efforts for large language models (LLMs) today focus on evaluating foundation models. However, there is a growing need to evaluate safety at the application level, as components such as system prompts, retrieval pipelines, and guardrails introduce additional factors that significantly influence the overall safety of LLM applications. In this paper, we introduce a practical framework for evaluating application-level safety in LLM systems, validated through real-world deployment across multiple use cases within our organization. The framework consists of two parts: (1) principles for developing customized safety risk taxonomies, and (2) practices for evaluating safety risks in LLM applications. We illustrate how the proposed framework was applied in our internal pilot, providing a reference point for organizations seeking to scale their safety testing efforts. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSafety Systems Engineering in Autonomy · Risk and Safety Analysis · Software Reliability and Analysis Research