Helping Large Language Models Protect Themselves: An Enhanced Filtering and Summarization System

Sheikh Samit Muhaimin; Spyridon Mastorakis

arXiv:2505.01315·cs.CL·March 10, 2026

Helping Large Language Models Protect Themselves: An Enhanced Filtering and Summarization System

Sheikh Samit Muhaimin, Spyridon Mastorakis

PDF

TL;DR

This paper introduces a novel, retraining-free system for large language models that detects, filters, and summarizes malicious inputs to enhance their resistance against adversarial attacks, manipulative prompts, and encoded threats.

Contribution

The study presents an innovative defense framework combining NLP-based filtering and summarization modules that improve LLM security without retraining or fine-tuning.

Findings

01

98.71% success rate in identifying harmful prompts

02

Enhanced resistance to jailbreak and malicious inputs

03

Maintains response quality while increasing security

Abstract

The recent growth in the use of Large Language Models has made them vulnerable to sophisticated adversarial assaults, manipulative prompts, and encoded malicious inputs. Existing countermeasures frequently necessitate retraining models, which is computationally costly and impracticable for deployment. Without the need for retraining or fine-tuning, this study presents a unique defense paradigm that allows LLMs to recognize, filter, and defend against adversarial or malicious inputs on their own. There are two main parts to the suggested framework: (1) A prompt filtering module that uses sophisticated Natural Language Processing (NLP) techniques, including zero-shot classification, keyword analysis, and encoded content detection (e.g. base64, hexadecimal, URL encoding), to detect, decode, and classify harmful inputs; and (2) A summarization module that processes and summarizes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsUmbrella Reinforcement Learning