Helping Large Language Models Protect Themselves: An Enhanced Filtering and Summarization System
Sheikh Samit Muhaimin, Spyridon Mastorakis

TL;DR
This paper introduces a novel, retraining-free system for large language models that detects, filters, and summarizes malicious inputs to enhance their resistance against adversarial attacks, manipulative prompts, and encoded threats.
Contribution
The study presents an innovative defense framework combining NLP-based filtering and summarization modules that improve LLM security without retraining or fine-tuning.
Findings
98.71% success rate in identifying harmful prompts
Enhanced resistance to jailbreak and malicious inputs
Maintains response quality while increasing security
Abstract
The recent growth in the use of Large Language Models has made them vulnerable to sophisticated adversarial assaults, manipulative prompts, and encoded malicious inputs. Existing countermeasures frequently necessitate retraining models, which is computationally costly and impracticable for deployment. Without the need for retraining or fine-tuning, this study presents a unique defense paradigm that allows LLMs to recognize, filter, and defend against adversarial or malicious inputs on their own. There are two main parts to the suggested framework: (1) A prompt filtering module that uses sophisticated Natural Language Processing (NLP) techniques, including zero-shot classification, keyword analysis, and encoded content detection (e.g. base64, hexadecimal, URL encoding), to detect, decode, and classify harmful inputs; and (2) A summarization module that processes and summarizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsUmbrella Reinforcement Learning
