On Differentially Private String Distances
Jerry Yao-Chieh Hu, Erzhi Liu, Han Liu, Zhao Song, Lichen Zhang

TL;DR
This paper introduces differentially private data structures for efficiently estimating Hamming and edit distances between strings, ensuring privacy and accuracy even with multiple queries.
Contribution
It presents novel DP data structures for string distance estimation that are both time- and space-efficient, with improved accuracy bounds for Hamming and edit distances.
Findings
Answer queries in near-linear time for Hamming distance
Achieve small deviation from true distances with privacy guarantees
Support sublinear query operations for moderate distance k
Abstract
Given a database of bit strings , a fundamental data structure task is to estimate the distances between a given query with all the strings in the database. In addition, one might further want to ensure the integrity of the database by releasing these distance statistics in a secure manner. In this work, we propose differentially private (DP) data structures for this type of tasks, with a focus on Hamming and edit distance. On top of the strong privacy guarantees, our data structures are also time- and space-efficient. In particular, our data structure is -DP against any sequence of queries of arbitrary length, and for any query such that the maximum distance to any string in the database is at most , we output distance estimates. Moreover, - For Hamming distance, our data structure answers any query in $\widetilde…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper addresses a well-defined mathematical problem and appear to involve nontrivial analysis of perturbed versions of string sketches (e.g. LSH ouputs).
* Motivation: The paper does not clearly list plausible settings where this notion of privacy (and accuracy) make sense. In the DNA application, for example, the privacy protection seems very weak. * Significance: The algorithm provides nontrivial accuracy guarantees only for very large values of $\epsilon$. I couldn't understand a setting where this guarantee would be useful/important. (The paper mentions that $\epsilon = k \log k$ is "too large for most applications". Why is $\log k$ ok? Both
They study one of the most fundamental data structure problems, and the error metric is fine. Their results are clearly stated, which makes it easy to understand the merit of the paper. I really want to thank the authors for that.
The error on approximating Hamming distance between each query and database string scales as $k \log k$ (for typical small $\varepsilon$. I am confused as to why the algorithm's accuracy is in any way meaningful. The algorithmic novelty for the Hamming distance data structure is unclear. It is described as an adaptation of a non-private approach, followed by the use of a randomized response.
The paper is generally well written and easy to follow. A notable feature of the proposed data structure is that it supports an unbounded (potentially infinite) number of queries while maintaining differential privacy.
The assumptions and resulting bounds in the paper do not appear to be meaningful (please correct me if I am mistaken). Let $A_1, \ldots, A_m$ denote the dataset strings and $B$ the query string. 1. The assumption that $D(A_i, B) \le k$ for all $i \in [m]$ implies that $D(A_i, A_j) \le 2k$ for all pairs $(i, j)$. Consequently, all dataset strings must be highly similar to each other if $k$ is small. 2. On page 6, the reported Hamming distance error bound is $$ \frac{k \l
String distance is a natural problem, and I am not aware of past work on doing it with DP. The approach for Hamming distance is intuitive.
1) For Hamming distance, unless $\varepsilon$ is very large, the error bound $k/e^{\varepsilon/\log(k)})$ is close to the (trivial) error bound $k$ baked into the theorem assumption. The same problem holds (to a greater degree) for edit distance. This might be OK with even partial lower bounds, but no lower bounds are provided. Since $\varepsilon$-DP is a fairly meaningless privacy guarantee unless $\varepsilon$ is a small constant (say, $\varepsilon \ll 10$), these are very weak utility results
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLimits and Structures in Graph Theory · Cryptography and Data Security · Complexity and Algorithms in Graphs
MethodsFocus
