Addressing Labelled Data Scarcity: Taxonomy-Agnostic Annotation of PII Values in HTTP Traffic using LLMs
Thomas Cory, Axel K\"upper

TL;DR
This paper explores using Large Language Models to create flexible, taxonomy-agnostic annotations of PII in HTTP traffic, overcoming limitations of fixed-label systems and scarce data.
Contribution
It introduces a multi-stage LLM pipeline for taxonomy-agnostic PII annotation and a synthetic data generator for evaluation without sensitive data.
Findings
The pipeline accurately detects PII types across different taxonomies.
LLMs can extract PII values effectively in a taxonomy-agnostic manner.
Synthetic HTTP traffic with validated annotations supports evaluation without real user data.
Abstract
Automated privacy audits of web and mobile applications often analyse outbound HTTP traffic to detect Personally Identifiable Information (PII) leakage. However, existing learning-based detectors typically depend on scarce, manually labelled traffic and are tightly coupled to fixed label taxonomies, limiting transferability across domains and evolving definitions of PII. This paper investigates whether Large Language Models (LLMs) can support taxonomy-agnostic annotation of explicitly transmitted PII values in HTTP message bodies when the taxonomy is provided at runtime. We introduce a multi-stage LLM-based pipeline that combines deterministic pre-processing with label-level classification, targeted instance-level value annotation, and output validation. To enable controlled evaluation and exemplar-based prompting without relying on sensitive real-user captures, we further propose an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
