JurEE not Judges: safeguarding llm interactions with small, specialised Encoder Ensembles
Dom Nasrabadi

TL;DR
JurEE is an ensemble of specialized encoder-only transformer models that provides probabilistic risk assessments for AI-User interactions, outperforming existing methods in accuracy, speed, and cost-efficiency for content moderation tasks.
Contribution
This paper introduces JurEE, a novel ensemble of encoder-only transformers that offers robust, interpretable, and efficient risk estimation across diverse safety scenarios in LLM-based systems.
Findings
JurEE significantly outperforms baseline models in accuracy and speed.
JurEE demonstrates superior cost-efficiency for large-scale moderation.
The modular design allows customizable risk thresholds for various applications.
Abstract
We introduce JurEE, an ensemble of efficient, encoder-only transformer models designed to strengthen safeguards in AI-User interactions within LLM-based systems. Unlike existing LLM-as-Judge methods, which often struggle with generalization across risk taxonomies and only provide textual outputs, JurEE offers probabilistic risk estimates across a wide range of prevalent risks. Our approach leverages diverse data sources and employs progressive synthetic data generation techniques, including LLM-assisted augmentation, to enhance model robustness and performance. We create an in-house benchmark comprising of other reputable benchmarks such as the OpenAI Moderation Dataset and ToxicChat, where we find JurEE significantly outperforms baseline models, demonstrating superior accuracy, speed, and cost-efficiency. This makes it particularly suitable for applications requiring stringent content…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
MethodsSparse Evolutionary Training
