TRIDENT: Benchmarking LLM Safety in Finance, Medicine, and Law

Zheng Hui; Yijiang River Dong; Ehsan Shareghi; Nigel Collier

arXiv:2507.21134·cs.CL·July 30, 2025

TRIDENT: Benchmarking LLM Safety in Finance, Medicine, and Law

Zheng Hui, Yijiang River Dong, Ehsan Shareghi, Nigel Collier

PDF

3 Reviews

TL;DR

This paper introduces Trident-Bench, a benchmark for evaluating safety of large language models in high-risk domains like law, finance, and medicine, revealing safety gaps in current models and emphasizing the need for domain-specific safety improvements.

Contribution

It defines domain-specific safety principles and presents Trident-Bench, a systematic benchmark for assessing LLM safety in regulated fields, which was previously lacking.

Findings

01

Generalist models meet basic safety expectations

02

Domain-specialized models struggle with ethical nuances

03

Benchmark reveals significant safety gaps in current models

Abstract

As large language models (LLMs) are increasingly deployed in high-risk domains such as law, finance, and medicine, systematically evaluating their domain-specific safety and compliance becomes critical. While prior work has largely focused on improving LLM performance in these domains, it has often neglected the evaluation of domain-specific safety risks. To bridge this gap, we first define domain-specific safety principles for LLMs based on the AMA Principles of Medical Ethics, the ABA Model Rules of Professional Conduct, and the CFA Institute Code of Ethics. Building on this foundation, we introduce Trident-Bench, a benchmark specifically targeting LLM safety in the legal, financial, and medical domains. We evaluated 19 general-purpose and domain-specialized models on Trident-Bench and show that it effectively reveals key safety gaps -- strong generalist models (e.g., GPT, Gemini) can…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 5

Strengths

1. The contributed dataset might be useful for the safety test of LLMs on those critical domains. 2. The results show that current LLMs still generate high harmfulness score responses to those prompts, revealing a potential safety concern. 3. The paper claimed to contribute the first prompts set for the law and finance domain, though this claim is not grounded in the literature.

Weaknesses

1. The biggest concern is the generation of those harmful prompts. They are generated by jailbreaking current LLMs and further reviewed by human experts. However, since they are generated through jailbreaking, they do not represent the natural distribution of real-life harmful prompts where the models should refuse without jailbreaking. Since there are always ways to jailbreak current models, the model's refusal without jailbreaking is more meaningful. 2. The authors claim the dataset to be the

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper addresses a timely and underexplored problem of evaluating the LLM ethical safety in high-stakes fields. 2. The benchmark is built on authoritative ethical codes (AMA, ABA, CFA), ensuring that its grounding in real-world principles is robust and credible. 3. The multi-stage, expert-verified annotation pipeline adds strong methodological rigor, with unanimous expert agreement enhancing the benchmark’s precision and trustworthiness. 4. The evaluation covers a diverse range of models,

Weaknesses

1. Unsafe behavior often arises in evolving conversations, missing consideration of these cases can limit the utility. 2. The total-refusal evaluation design might overemphasize binary refusal behavior rather than nuanced ethical reasoning or contextual understanding of safe alternatives. 3. Although expert validation is emphasized, the annotation cost and scalability (>$18k) may make it difficult to reproduce or expand the dataset, restricting long-term accessibility.

Reviewer 03Rating 4Confidence 4

Strengths

1. The topic being studied is of high importance: the three domains are high-stakes yet less studied in the current community. Building effective, robust, and safe models in these domains is a critical challenge. 2. The benchmark is a great resource contribution for the community for evaluating and developing future LLMs in these high-importance domains. 3. The benchmark construction process involves extensive domain expert collaborations, which ensures the professionalism and high quality o

Weaknesses

1. The choice of professional codes is the foundation for this framework. Does the choice of the three sets of principles come from collaboration with domain experts? How can we know if these principles ensure a good coverage of all the use cases? 2. For the harmful prompt generation, how do you ensure the generated prompts cover all the principles? Based on the current description of the generation process, it's more like the prompts are first generated and then filtered using alignments with

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.