NeuroFilter: Privacy Guardrails for Conversational LLM Agents
Saswat Das, Ferdinando Fioretto

TL;DR
NeuroFilter is a novel framework that detects privacy violations in conversational LLMs by analyzing internal model representations, effectively preventing attacks with minimal false positives and significantly lower computational costs.
Contribution
This paper introduces NeuroFilter, a new method that uses linear structure in internal representations to detect privacy violations in LLMs during conversations, improving efficiency and robustness.
Findings
High detection accuracy across diverse models and interactions.
Zero false positives on benign prompts.
Significant reduction in computational inference cost.
Abstract
This work addresses the computational challenge of enforcing privacy for agentic Large Language Models (LLMs), where privacy is governed by the contextual integrity framework. Indeed, existing defenses rely on LLM-mediated checking stages that add substantial latency and cost, and that can be undermined in multi-turn interactions through manipulation or benign-looking conversational scaffolding. Contrasting this background, this paper makes a key observation: internal representations associated with privacy-violating intent can be separated from benign requests using linear structure. Using this insight, the paper proposes NeuroFilter, a guardrail framework that operationalizes contextual integrity by mapping norm violations to simple directions in the model's activation space, enabling detection even when semantic filters are bypassed. The proposed filter is also extended to capture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Adversarial Robustness in Machine Learning · Topic Modeling
