Sentra-Guard: A Real-Time Multilingual Defense Against Adversarial LLM Prompts
Md. Mehedi Hasan, Sk Tanzir Mehedi, Ziaur Rahman, Rafid Mostafiz, and Md. Abir Hossain

TL;DR
Sentra-Guard is a real-time, multilingual, modular system that detects and mitigates adversarial prompts targeting large language models with high accuracy and low attack success rate.
Contribution
It introduces a hybrid classifier-retriever architecture with multilingual support and human-in-the-loop feedback for adaptive adversarial prompt defense.
Findings
Achieves 99.96% detection rate with F1 score of 1.00
Reduces attack success rate to 0.004%
Outperforms existing baselines like LlamaGuard-2 and OpenAI Moderation
Abstract
This paper presents a real-time modular defense system named Sentra-Guard. The system detects and mitigates jailbreak and prompt injection attacks targeting large language models (LLMs). The framework uses a hybrid architecture with FAISS-indexed SBERT embedding representations that capture the semantic meaning of prompts, combined with fine-tuned transformer classifiers, which are machine learning models specialized for distinguishing between benign and adversarial language inputs. It identifies adversarial prompts in both direct and obfuscated attack vectors. A core innovation is the classifier-retriever fusion module, which dynamically computes context-aware risk scores that estimate how likely a prompt is to be adversarial based on its content and context. The framework ensures multilingual resilience with a language-agnostic preprocessing layer. This component automatically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
