TL;DR
This paper introduces Trident-Bench, a benchmark for evaluating safety of large language models in high-risk domains like law, finance, and medicine, revealing safety gaps in current models and emphasizing the need for domain-specific safety improvements.
Contribution
It defines domain-specific safety principles and presents Trident-Bench, a systematic benchmark for assessing LLM safety in regulated fields, which was previously lacking.
Findings
Generalist models meet basic safety expectations
Domain-specialized models struggle with ethical nuances
Benchmark reveals significant safety gaps in current models
Abstract
As large language models (LLMs) are increasingly deployed in high-risk domains such as law, finance, and medicine, systematically evaluating their domain-specific safety and compliance becomes critical. While prior work has largely focused on improving LLM performance in these domains, it has often neglected the evaluation of domain-specific safety risks. To bridge this gap, we first define domain-specific safety principles for LLMs based on the AMA Principles of Medical Ethics, the ABA Model Rules of Professional Conduct, and the CFA Institute Code of Ethics. Building on this foundation, we introduce Trident-Bench, a benchmark specifically targeting LLM safety in the legal, financial, and medical domains. We evaluated 19 general-purpose and domain-specialized models on Trident-Bench and show that it effectively reveals key safety gaps -- strong generalist models (e.g., GPT, Gemini) can…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The contributed dataset might be useful for the safety test of LLMs on those critical domains. 2. The results show that current LLMs still generate high harmfulness score responses to those prompts, revealing a potential safety concern. 3. The paper claimed to contribute the first prompts set for the law and finance domain, though this claim is not grounded in the literature.
1. The biggest concern is the generation of those harmful prompts. They are generated by jailbreaking current LLMs and further reviewed by human experts. However, since they are generated through jailbreaking, they do not represent the natural distribution of real-life harmful prompts where the models should refuse without jailbreaking. Since there are always ways to jailbreak current models, the model's refusal without jailbreaking is more meaningful. 2. The authors claim the dataset to be the
1. The paper addresses a timely and underexplored problem of evaluating the LLM ethical safety in high-stakes fields. 2. The benchmark is built on authoritative ethical codes (AMA, ABA, CFA), ensuring that its grounding in real-world principles is robust and credible. 3. The multi-stage, expert-verified annotation pipeline adds strong methodological rigor, with unanimous expert agreement enhancing the benchmark’s precision and trustworthiness. 4. The evaluation covers a diverse range of models,
1. Unsafe behavior often arises in evolving conversations, missing consideration of these cases can limit the utility. 2. The total-refusal evaluation design might overemphasize binary refusal behavior rather than nuanced ethical reasoning or contextual understanding of safe alternatives. 3. Although expert validation is emphasized, the annotation cost and scalability (>$18k) may make it difficult to reproduce or expand the dataset, restricting long-term accessibility.
1. The topic being studied is of high importance: the three domains are high-stakes yet less studied in the current community. Building effective, robust, and safe models in these domains is a critical challenge. 2. The benchmark is a great resource contribution for the community for evaluating and developing future LLMs in these high-importance domains. 3. The benchmark construction process involves extensive domain expert collaborations, which ensures the professionalism and high quality o
1. The choice of professional codes is the foundation for this framework. Does the choice of the three sets of principles come from collaboration with domain experts? How can we know if these principles ensure a good coverage of all the use cases? 2. For the harmful prompt generation, how do you ensure the generated prompts cover all the principles? Based on the current description of the generation process, it's more like the prompts are first generated and then filtered using alignments with
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
