Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration
Tatia Tsmindashvili, Ana Kolkhidashvili, Dachi Kurtskhalia, Nino Maghlakelidze, Elene Mekvabishvili, Guram Dentoshvili, Orkhan Shamilov, Zaal Gachechiladze, Steven Saporta, David Dachi Choladze

TL;DR
This paper presents Archias, an expert model that enhances large language models' security and domain-specific accuracy by classifying and filtering user inputs to prevent jailbreaks and prompt injections, especially in the automotive industry.
Contribution
Introduction of Archias, a small, customizable expert model that improves LLM security and domain relevance by classifying user queries and integrating outputs into prompts.
Findings
Archias effectively detects malicious and out-of-domain queries.
Integration of Archias improves LLM response safety and relevance.
Benchmark dataset for automotive industry created and released.
Abstract
Using LLMs in a production environment presents security challenges that include vulnerabilities to jailbreaks and prompt injections, which can result in harmful outputs for humans or the enterprise. The challenge is amplified when working within a specific domain, as topics generally accepted for LLMs to address may be irrelevant to that field. These problems can be mitigated, for example, by fine-tuning large language models with domain-specific and security-focused data. However, these alone are insufficient, as jailbreak techniques evolve. Additionally, API-accessed models do not offer the flexibility needed to tailor behavior to industry-specific objectives, and in-context learning is not always sufficient or reliable. In response to these challenges, we introduce Archias, an expert model adept at distinguishing between in-domain and out-of-domain communications. Archias classifies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
