Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset

Zhuqian Zhou; Kirk Vanacore; Bakhtawar Ahtisham; Jinsook Lee; Doug Pietrzak; Daryl Hedley; Jorge Dias; Chris Shaw; Ruth Sch\"afer; Ren\'e F. Kizilcec

arXiv:2602.16571·cs.CL·May 11, 2026

Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset

Zhuqian Zhou, Kirk Vanacore, Bakhtawar Ahtisham, Jinsook Lee, Doug Pietrzak, Daryl Hedley, Jorge Dias, Chris Shaw, Ruth Sch\"afer, Ren\'e F. Kizilcec

PDF

1 Datasets

TL;DR

This paper introduces MathEd-PII, a benchmark dataset for PII detection in math tutoring dialogues, and demonstrates that domain-aware, segment-aware prompting significantly improves de-identification accuracy while maintaining educational utility.

Contribution

The work presents the first benchmark dataset for PII detection in math tutoring data and shows that domain-aware prompting enhances de-identification performance.

Findings

01

Domain-aware prompting achieves F1: 0.802 and 0.821, outperforming baseline.

02

False PII redactions cluster in math-dense regions, indicating numeric ambiguity.

03

Domain context is essential for utility-preserving de-identification.

Abstract

Large-scale sharing of dialogue data is key to advancing the science of teaching and learning, yet rigorous de-identification remains a major barrier. In mathematics tutoring transcripts, numeric expressions frequently resemble structured identifiers (e.g., dates or IDs), leading generic Personally Identifiable Information (PII) detection systems to over-redact core instructional content and reduce data utility. This work asks how to detect PII while preserving educational utility, focusing on this "numeric ambiguity" problem. We introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, built with human-in-the-loop LLM annotation. Using density-based segmentation, we show that false PII redactions cluster in math-dense regions, confirming numeric ambiguity as a key failure mode. We then compare four detection strategies: a Presidio baseline and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

NationalTutoringObservatory/MathEd-PII
dataset· 71 dl
71 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.