Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration

Tatia Tsmindashvili; Ana Kolkhidashvili; Dachi Kurtskhalia; Nino Maghlakelidze; Elene Mekvabishvili; Guram Dentoshvili; Orkhan Shamilov; Zaal Gachechiladze; Steven Saporta; David Dachi Choladze

arXiv:2505.17066·cs.CR·August 12, 2025

Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration

Tatia Tsmindashvili, Ana Kolkhidashvili, Dachi Kurtskhalia, Nino Maghlakelidze, Elene Mekvabishvili, Guram Dentoshvili, Orkhan Shamilov, Zaal Gachechiladze, Steven Saporta, David Dachi Choladze

PDF

TL;DR

This paper presents Archias, an expert model that enhances large language models' security and domain-specific accuracy by classifying and filtering user inputs to prevent jailbreaks and prompt injections, especially in the automotive industry.

Contribution

Introduction of Archias, a small, customizable expert model that improves LLM security and domain relevance by classifying user queries and integrating outputs into prompts.

Findings

01

Archias effectively detects malicious and out-of-domain queries.

02

Integration of Archias improves LLM response safety and relevance.

03

Benchmark dataset for automotive industry created and released.

Abstract

Using LLMs in a production environment presents security challenges that include vulnerabilities to jailbreaks and prompt injections, which can result in harmful outputs for humans or the enterprise. The challenge is amplified when working within a specific domain, as topics generally accepted for LLMs to address may be irrelevant to that field. These problems can be mitigated, for example, by fine-tuning large language models with domain-specific and security-focused data. However, these alone are insufficient, as jailbreak techniques evolve. Additionally, API-accessed models do not offer the flexibility needed to tailor behavior to industry-specific objectives, and in-context learning is not always sufficient or reliable. In response to these challenges, we introduce Archias, an expert model adept at distinguishing between in-domain and out-of-domain communications. Archias classifies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.