Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

Shayan Ali Hassan; Tao Ni; Zafar Ayyub Qazi; Marco Canini

arXiv:2602.08062·cs.LG·February 10, 2026

Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

Shayan Ali Hassan, Tao Ni, Zafar Ayyub Qazi, Marco Canini

PDF

Open Access

TL;DR

BAGEL is a modular ensemble framework that efficiently detects malicious prompts in LLMs, adapting incrementally to new threats with high accuracy and interpretability, outperforming large-scale models in safety tasks.

Contribution

The paper introduces BAGEL, a lightweight, incrementally updatable ensemble system that improves malicious prompt detection in LLMs, balancing performance, efficiency, and adaptability.

Findings

01

BAGEL achieves an F1 score of 0.92 with only 5 ensemble members.

02

It outperforms OpenAI Moderation API and ShieldGemma in detection accuracy.

03

BAGEL maintains robustness after nine incremental updates.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and generation. However, these systems remain susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer limited transparency and adapt poorly to evolving threats, while white-box approaches using large LLM judges impose prohibitive computational costs and require expensive retraining for new attacks. Current systems force designers to choose between performance, efficiency, and adaptability. To address these challenges, we present BAGEL (Bootstrap AGgregated Ensemble Layer), a modular, lightweight, and incrementally updatable framework for malicious prompt detection. BAGEL employs a bootstrap…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques