CAPID: Context-Aware PII Detection for Question-Answering Systems

Mariia Ponomarenko; Sepideh Abedini; Masoumeh Shafieinejad; D.B.Emerson; Shubhankar Mohapatra; Xi He

arXiv:2602.10074·cs.CR·February 11, 2026

CAPID: Context-Aware PII Detection for Question-Answering Systems

Mariia Ponomarenko, Sepideh Abedini, Masoumeh Shafieinejad, D.B.Emerson, Shubhankar Mohapatra, Xi He

PDF

Open Access 1 Video

TL;DR

This paper introduces CAPID, a privacy-preserving system that fine-tunes small language models to detect and classify contextually relevant PII in user queries, improving privacy and response quality in question-answering systems.

Contribution

It presents a novel synthetic data generation pipeline and a fine-tuning approach for small language models to detect and classify relevant PII in context.

Findings

01

Relevance-aware PII detection outperforms existing baselines.

02

The approach preserves higher downstream utility.

03

The synthetic dataset covers diverse PII types and relevance levels.

Abstract

Detecting personally identifiable information (PII) in user queries is critical for ensuring privacy in question-answering systems. Current approaches mainly redact all PII, disregarding the fact that some of them may be contextually relevant to the user's question, resulting in a degradation of response quality. Large language models (LLMs) might be able to help determine which PII are relevant, but due to their closed source nature and lack of privacy guarantees, they are unsuitable for sensitive data processing. To achieve privacy-preserving PII detection, we propose CAPID, a practical approach that fine-tunes a locally owned small language model (SLM) that filters sensitive information before it is passed to LLMs for QA. However, existing datasets do not capture the context-dependent relevance of PII needed to train such a model effectively. To fill this gap, we propose a synthetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CAPID: Context-Aware PII Detection for Question-Answering Systems· underline

Taxonomy

TopicsTopic Modeling · Advanced Graph Neural Networks · Expert finding and Q&A systems