NeuroFilter: Privacy Guardrails for Conversational LLM Agents

Saswat Das; Ferdinando Fioretto

arXiv:2601.14660·cs.CR·January 22, 2026

NeuroFilter: Privacy Guardrails for Conversational LLM Agents

Saswat Das, Ferdinando Fioretto

PDF

Open Access

TL;DR

NeuroFilter is a novel framework that detects privacy violations in conversational LLMs by analyzing internal model representations, effectively preventing attacks with minimal false positives and significantly lower computational costs.

Contribution

This paper introduces NeuroFilter, a new method that uses linear structure in internal representations to detect privacy violations in LLMs during conversations, improving efficiency and robustness.

Findings

01

High detection accuracy across diverse models and interactions.

02

Zero false positives on benign prompts.

03

Significant reduction in computational inference cost.

Abstract

This work addresses the computational challenge of enforcing privacy for agentic Large Language Models (LLMs), where privacy is governed by the contextual integrity framework. Indeed, existing defenses rely on LLM-mediated checking stages that add substantial latency and cost, and that can be undermined in multi-turn interactions through manipulation or benign-looking conversational scaffolding. Contrasting this background, this paper makes a key observation: internal representations associated with privacy-violating intent can be separated from benign requests using linear structure. Using this insight, the paper proposes NeuroFilter, a guardrail framework that operationalizes contextual integrity by mapping norm violations to simple directions in the model's activation space, enabling detection even when semantic filters are bypassed. The proposed filter is also extended to capture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Adversarial Robustness in Machine Learning · Topic Modeling