Privacy-Preserving Prompt Injection Detection for LLMs Using Federated Learning and Embedding-Based NLP Classification
Hasini Jayathilaka

TL;DR
This paper introduces a privacy-preserving framework for detecting prompt injection attacks on large language models using federated learning and embedding-based NLP classification, enabling effective detection without exposing raw data.
Contribution
It presents a novel federated learning approach for prompt injection detection that maintains user privacy while achieving performance comparable to centralized methods.
Findings
Federated approach preserves privacy effectively.
Detection performance comparable to centralized models.
Proof-of-concept for privacy-aware LLM security.
Abstract
Prompt injection attacks are an emerging threat to large language models (LLMs), enabling malicious users to manipulate outputs through carefully designed inputs. Existing detection approaches often require centralizing prompt data, creating significant privacy risks. This paper proposes a privacy-preserving prompt injection detection framework based on federated learning and embedding-based classification. A curated dataset of benign and adversarial prompts was encoded with sentence embedding and used to train both centralized and federated logistic regression models. The federated approach preserved privacy by sharing only model parameters across clients, while achieving detection performance comparable to centralized training. Results demonstrate that effective prompt injection detection is feasible without exposing raw data, making this one of the first explorations of federated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Topic Modeling · Authorship Attribution and Profiling
