On Differentially Private String Distances

Jerry Yao-Chieh Hu; Erzhi Liu; Han Liu; Zhao Song; Lichen Zhang

arXiv:2411.05750·cs.DS·November 11, 2024

On Differentially Private String Distances

Jerry Yao-Chieh Hu, Erzhi Liu, Han Liu, Zhao Song, Lichen Zhang

PDF

Open Access 4 Reviews

TL;DR

This paper introduces differentially private data structures for efficiently estimating Hamming and edit distances between strings, ensuring privacy and accuracy even with multiple queries.

Contribution

It presents novel DP data structures for string distance estimation that are both time- and space-efficient, with improved accuracy bounds for Hamming and edit distances.

Findings

01

Answer queries in near-linear time for Hamming distance

02

Achieve small deviation from true distances with privacy guarantees

03

Support sublinear query operations for moderate distance k

Abstract

Given a database of bit strings $A_{1}, \dots, A_{m} \in {0, 1}^{n}$ , a fundamental data structure task is to estimate the distances between a given query $B \in {0, 1}^{n}$ with all the strings in the database. In addition, one might further want to ensure the integrity of the database by releasing these distance statistics in a secure manner. In this work, we propose differentially private (DP) data structures for this type of tasks, with a focus on Hamming and edit distance. On top of the strong privacy guarantees, our data structures are also time- and space-efficient. In particular, our data structure is $ϵ$ -DP against any sequence of queries of arbitrary length, and for any query $B$ such that the maximum distance to any string in the database is at most $k$ , we output $m$ distance estimates. Moreover, - For Hamming distance, our data structure answers any query in $\widetilde…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 3

Strengths

The paper addresses a well-defined mathematical problem and appear to involve nontrivial analysis of perturbed versions of string sketches (e.g. LSH ouputs).

Weaknesses

* Motivation: The paper does not clearly list plausible settings where this notion of privacy (and accuracy) make sense. In the DNA application, for example, the privacy protection seems very weak. * Significance: The algorithm provides nontrivial accuracy guarantees only for very large values of $\epsilon$. I couldn't understand a setting where this guarantee would be useful/important. (The paper mentions that $\epsilon = k \log k$ is "too large for most applications". Why is $\log k$ ok? Both

Reviewer 02Rating 4Confidence 3

Strengths

They study one of the most fundamental data structure problems, and the error metric is fine. Their results are clearly stated, which makes it easy to understand the merit of the paper. I really want to thank the authors for that.

Weaknesses

The error on approximating Hamming distance between each query and database string scales as $k \log k$ (for typical small $\varepsilon$. I am confused as to why the algorithm's accuracy is in any way meaningful. The algorithmic novelty for the Hamming distance data structure is unclear. It is described as an adaptation of a non-private approach, followed by the use of a randomized response.

Reviewer 03Rating 4Confidence 4

Strengths

The paper is generally well written and easy to follow. A notable feature of the proposed data structure is that it supports an unbounded (potentially infinite) number of queries while maintaining differential privacy.

Weaknesses

The assumptions and resulting bounds in the paper do not appear to be meaningful (please correct me if I am mistaken). Let $A_1, \ldots, A_m$ denote the dataset strings and $B$ the query string. 1. The assumption that $D(A_i, B) \le k$ for all $i \in [m]$ implies that $D(A_i, A_j) \le 2k$ for all pairs $(i, j)$. Consequently, all dataset strings must be highly similar to each other if $k$ is small. 2. On page 6, the reported Hamming distance error bound is $$ \frac{k \l

Reviewer 04Rating 2Confidence 4

Strengths

String distance is a natural problem, and I am not aware of past work on doing it with DP. The approach for Hamming distance is intuitive.

Weaknesses

1) For Hamming distance, unless $\varepsilon$ is very large, the error bound $k/e^{\varepsilon/\log(k)})$ is close to the (trivial) error bound $k$ baked into the theorem assumption. The same problem holds (to a greater degree) for edit distance. This might be OK with even partial lower bounds, but no lower bounds are provided. Since $\varepsilon$-DP is a fairly meaningless privacy guarantee unless $\varepsilon$ is a small constant (say, $\varepsilon \ll 10$), these are very weak utility results

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLimits and Structures in Graph Theory · Cryptography and Data Security · Complexity and Algorithms in Graphs

MethodsFocus