ExpGuard: LLM Content Moderation in Specialized Domains

Minseok Choi; Dongjin Kim; Seungbin Yang; Subin Kim; Youngjun Kwak; Juyoung Oh; Jaegul Choo; Jungmin Son

arXiv:2603.02588·cs.CL·March 4, 2026

ExpGuard: LLM Content Moderation in Specialized Domains

Minseok Choi, Dongjin Kim, Seungbin Yang, Subin Kim, Youngjun Kwak, Juyoung Oh, Jaegul Choo, Jungmin Son

PDF

Open Access 3 Reviews

TL;DR

ExpGuard is a specialized safety guardrail for large language models, focusing on domain-specific content moderation in finance, medical, and legal sectors, with a new dataset and robust evaluation methods.

Contribution

We introduce ExpGuard, a domain-specific guardrail model and dataset, enhancing safety and robustness of LLMs against technical and adversarial content in specialized fields.

Findings

01

ExpGuard outperforms existing models in domain-specific prompt and response classification.

02

The dataset ExpGuardMix enables effective training and evaluation of specialized guardrails.

03

ExpGuard demonstrates high resilience to adversarial attacks in targeted domains.

Abstract

With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets:…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. The proposed dataset is of high quality, which can promote the future research on LLM safety in specific domains. 2. Experimental results provide some insights into the task of LLM safety in specific domains. 3. This paper is well-written and easy to follow.

Weaknesses

1. The contribution of this paper mainly falls into the resource side. However, in my view, the technical contribution of the dataset construction method is somewhat limited. The authors spend a lot of room (i.e., Section 3.1 and 3.2) for the details about data construction, labeling, and filtering, whose technical design is comprehensive but not very novel. 2. The multi-task training method of EXPGUARD (Section 3.3) seems direct and intuitive, which is similar to the models for LLM safety of ge

Reviewer 02Rating 6Confidence 4

Strengths

1. the paper is well written and easy to follow 2. the paper studies moderation problems in specific domains such as finance, healthcare and law which are not studied extensively by popular methods. 3. the paper has attached its implementation and code which is good for reproducing the work 4. the comparison is extensive consisting of both test composed by ExpGuard and public tests.

Weaknesses

1. the paper mainly focuses on terminology in specific domains. In order to tackle such problems, extensive efforts have to be made such as crawling wikipedia to collect all of them. The whole generation and training process seems very resourcing consuming.

Reviewer 03Rating 4Confidence 4

Strengths

- The data construction pipeline for combining automated term mining, GPT-based generation, ensemble CoT-reasoning labelling, and expert verification is automatic and replicable for other domains. - The experimental evaluation is comprehensive, spanning both specialized and public benchmarks, with clear ablation studies showing the effect of each dataset component. - The paper is well-structured, with clear figures illustrating data composition, pipeline, and example prompts.

Weaknesses

- Unknown domain overlaps with existing datasets like WildGuardMix. Even though this paper focuses on special domains, it is hard to tell how different the proposed dataset is from existing datasets. It is observed that other guardrail models like WildGuard and LlamaGuard can already achieve over 70% or 80% F1 scores. This suggests that EXPGUARDMIX could only serve as a supplement to the existing data. Without the quantitative measurement of the overlapped data (both benign and harmful), it is h

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Topic Modeling