Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation
Shayan Ali Hassan, Tao Ni, Zafar Ayyub Qazi, Marco Canini

TL;DR
BAGEL is a modular ensemble framework that efficiently detects malicious prompts in LLMs, adapting incrementally to new threats with high accuracy and interpretability, outperforming large-scale models in safety tasks.
Contribution
The paper introduces BAGEL, a lightweight, incrementally updatable ensemble system that improves malicious prompt detection in LLMs, balancing performance, efficiency, and adaptability.
Findings
BAGEL achieves an F1 score of 0.92 with only 5 ensemble members.
It outperforms OpenAI Moderation API and ShieldGemma in detection accuracy.
BAGEL maintains robustness after nine incremental updates.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and generation. However, these systems remain susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer limited transparency and adapt poorly to evolving threats, while white-box approaches using large LLM judges impose prohibitive computational costs and require expensive retraining for new attacks. Current systems force designers to choose between performance, efficiency, and adaptability. To address these challenges, we present BAGEL (Bootstrap AGgregated Ensemble Layer), a modular, lightweight, and incrementally updatable framework for malicious prompt detection. BAGEL employs a bootstrap…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques
