CAPID: Context-Aware PII Detection for Question-Answering Systems
Mariia Ponomarenko, Sepideh Abedini, Masoumeh Shafieinejad, D.B.Emerson, Shubhankar Mohapatra, Xi He

TL;DR
This paper introduces CAPID, a privacy-preserving system that fine-tunes small language models to detect and classify contextually relevant PII in user queries, improving privacy and response quality in question-answering systems.
Contribution
It presents a novel synthetic data generation pipeline and a fine-tuning approach for small language models to detect and classify relevant PII in context.
Findings
Relevance-aware PII detection outperforms existing baselines.
The approach preserves higher downstream utility.
The synthetic dataset covers diverse PII types and relevance levels.
Abstract
Detecting personally identifiable information (PII) in user queries is critical for ensuring privacy in question-answering systems. Current approaches mainly redact all PII, disregarding the fact that some of them may be contextually relevant to the user's question, resulting in a degradation of response quality. Large language models (LLMs) might be able to help determine which PII are relevant, but due to their closed source nature and lack of privacy guarantees, they are unsuitable for sensitive data processing. To achieve privacy-preserving PII detection, we propose CAPID, a practical approach that fine-tunes a locally owned small language model (SLM) that filters sensitive information before it is passed to LLMs for QA. However, existing datasets do not capture the context-dependent relevance of PII needed to train such a model effectively. To fill this gap, we propose a synthetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Expert finding and Q&A systems
