Shh, don't say that! Domain Certification in LLMs

Cornelius Emde; Alasdair Paren; Preetham Arvind; Maxime Kayser; Tom; Rainforth; Thomas Lukasiewicz; Bernard Ghanem; Philip H.S. Torr; Adel Bibi

arXiv:2502.19320·cs.CL·March 10, 2025

Shh, don't say that! Domain Certification in LLMs

Cornelius Emde, Alasdair Paren, Preetham Arvind, Maxime Kayser, Tom, Rainforth, Thomas Lukasiewicz, Bernard Ghanem, Philip H.S. Torr, Adel Bibi

PDF

Open Access 2 Models 3 Reviews

TL;DR

This paper introduces domain certification for LLMs, providing guarantees on out-of-domain behavior, and proposes the VALID method to produce effective adversarial bounds with minimal impact on performance.

Contribution

It formalizes domain certification for LLMs, introduces the VALID approach for adversarial bounds, and demonstrates its effectiveness across various datasets.

Findings

01

VALID yields meaningful certificates bounding out-of-domain probabilities.

02

The method tightly bounds out-of-domain behavior with minimal refusal.

03

Evaluation shows robust performance across diverse datasets.

Abstract

Large language models (LLMs) are often deployed to perform constrained tasks, with narrow domains. For example, customer support bots can be built on top of LLMs, relying on their broad language understanding and capabilities to enhance performance. However, these LLMs are adversarially susceptible, potentially generating outputs outside the intended domain. To formalize, assess, and mitigate this risk, we introduce domain certification; a guarantee that accurately characterizes the out-of-domain behavior of language models. We then propose a simple yet effective approach, which we call VALID that provides adversarial bounds as a certificate. Finally, we evaluate our method across a diverse set of datasets, demonstrating that it yields meaningful certificates, which bound the probability of out-of-domain samples tightly with minimum penalty to refusal behavior.

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- Strong theoretical foundations for classifying out-of-domain behavior of language models and ways to prevent this. - The algorithm VALID is relatively straightforward and uses rejection sampling as an elegant way to achieve the certifications. - The empirical evidence presented is across various kinds of datasets(Tiny Shakespeare, MedQA) which shows the generizability of VALID. - The work is novel and needed to provide theoretical insights and guarantees for safe deployment of language models.

Weaknesses

The paper acknowledges most of the limitations, but there is still room for further discussion: 1. Lack of context for guide model G. The work does argue that potentially involving the context in the final answer could fix this issue, but then the method cannot work in cases where a user wants language model to be concise and this also increases the inference cost of models as many more tokens get sampled for the output. 2. Adversarial attacks on G/M. The work acknowledges and shows adversarial

Reviewer 02Rating 6Confidence 3

Strengths

1. This work introduces and formalize domain certification for characterizing the out-of-domain behavior of language models under adversarial attach, which is useful for language models tuned a specific domain. 2. An approach called VALID is proposed to achieve domain certification. VALID bounds the probability of a LLM answering out-of-domain questions.

Weaknesses

The main concern for this work is its limitations including the reliance on the domain generator which doesn't consider the model input. In addition, Theorem 1 assumes the certificate is useful given G is trained on in-domain data. However, as language models are usually pre-trained on large amound of text data, which ingests world knowledge into it. Therefore, model G can contains out-of-domain knowedge, which makes Theorem 1 extremely limited.

Reviewer 03Rating 5Confidence 4

Strengths

1. Motivate the difficulty and real world implications of certifiable test time behavior for model LLM systems. 2. Present a simple and clear algorithm to achieve their definition of domain certification. 3. Identify a clever way to harness the limited abilities of very small, cheap to train, domain specific language models to implement test time rejection sampling favoring a constrained distribution of responses.

Weaknesses

1. Work under very limiting constraints of a fixed set of bad strings F. This is a practical assumption, but pervasive limitation for this work and of any other attempts to characterize the output spaces of generative models with large vocabularies, and variable size outputs. See lines 132-137 for the authors' own description of this required narrowing of scope. This always leaves room for adversaries to circumvent certificates via finding inputs in the complement of the finite sample of F chose

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)

MethodsIs Venmo Customer Support Available 24/7? How to Reach a Real Person · Sparse Evolutionary Training