Combinatorial Algorithms for String Sanitization
Giulia Bernardini, Huiping Chen, Alessio Conte, Roberto Grossi,, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone, Michelle, Sweering

TL;DR
This paper introduces algorithms for sanitizing strings to hide sensitive patterns while preserving data utility, balancing minimal length, pattern integrity, and minimal edits, with applications in location and DNA data sharing.
Contribution
It presents time-optimal algorithms for string sanitization that conceal sensitive patterns while maintaining non-sensitive pattern properties, including a heuristic for enhanced security.
Findings
Algorithms achieve minimal-length sanitized strings.
The heuristic prevents pattern reinstatement.
Methods preserve pattern frequencies and order.
Abstract
String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge. In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility, in two settings that are relevant to many common string processing tasks. In the first setting, we aim to generate the minimal-length string that preserves the order of appearance and frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. To construct such a string, we propose a time-optimal algorithm, TFS-ALGO. We also propose another time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of…
| Dataset | Data domain | Length | Alphabet | # sensitive | # sensitive | Pattern | Implausible pat. |
|---|---|---|---|---|---|---|---|
| size | patterns | positions | length | threshold | |||
| OLD | Movement | 85,563 | 100 | ||||
| TRU | Transportation | 5,763 | 100 | ||||
| MSN | Web | 4,698,764 | 17 | ||||
| DNA | Genomic | 4,641,652 | 4 | ||||
| SYN | Synthetic | 20,000,000 | 10 | - | |||
| SYN | Synthetic | 1,000 | 2 | - |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan, Italy, 11email: [email protected] 22institutetext: Department of Informatics, King’s College London, London, UK
22email: [huiping.chen,grigorios.loukides]@kcl.ac.uk 33institutetext: Department of Computer Science, University of Pisa, Pisa, Italy
33email: [conte,grossi,pisanti]@di.unipi.it, 33email: [email protected] 44institutetext: ERABLE Team, INRIA, Lyon, France 55institutetext: CWI, Amsterdam, The Netherlands, 55email: [solon.pissis,michelle.sweering]@cwi.nl
Combinatorial Algorithms for String Sanitization
Giulia Bernardini
11
Huiping Chen
22
Alessio Conte
33
Roberto Grossi
3344
Grigorios Loukides
22
Nadia Pisanti
3344
Solon P. Pissis
4455
Giovanna Rosone
33
Michelle Sweering
55
Abstract
String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (*e.g., *trips to mental health clinics from a string representing a user’s location history). In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility, in two settings that are relevant to many common string processing tasks.
In the first setting, we aim to generate the minimal-length string that preserves the order of appearance and frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. To construct such a string, we propose a time-optimal algorithm, TFS-ALGO. We also propose another time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensitive patterns but produces a much shorter string that can be analyzed more efficiently. The strings produced by either of these algorithms are constructed by concatenating non-sensitive parts of the input string. However, it is possible to detect the sensitive patterns by “reversing” the concatenation operations. In response, we propose a heuristic, MCSR-ALGO, which replaces letters in the strings output by the algorithms with carefully selected letters, so that sensitive patterns are not reinstated, implausible patterns are not introduced, and occurrences of spurious patterns are prevented. In the second setting, we aim to generate a string that is at minimal edit distance from the original string, in addition to preserving the order of appearance and frequency of all non-sensitive patterns. To construct such a string, we propose an algorithm, ETFS-ALGO, based on solving specific instances of approximate regular expression matching.
We implemented our sanitization approach that applies TFS-ALGO, PFS-ALGO and then MCSR-ALGO and experimentally show that it is effective and efficient. We also show that TFS-ALGO is nearly as effective at minimizing the edit distance as ETFS-ALGO, while being substantially more efficient than ETFS-ALGO.
1 Introduction
A large number of applications, in domains ranging from transportation to web analytics and bioinformatics feature data modeled as strings, *i.e., *sequences of letters over some finite alphabet. For instance, a string may represent the history of visited locations of one or more individuals, with each letter corresponding to a location. Similarly, it may represent the history of search query terms of one or more web users, with letters corresponding to query terms, or a medically important part of the DNA sequence of a patient, with letters corresponding to DNA bases. Analyzing such strings is key in applications including location-based service provision, product recommendation, and DNA sequence analysis. Therefore, such strings are often disseminated beyond the party that has collected them. For example, location-based service providers often outsource their data to data analytics companies who perform tasks such as similarity evaluation between strings [20], and retailers outsource their data to marketing agencies who perform tasks such as mining frequent patterns from the strings [21].
However, disseminating a string intact may result in the exposure of confidential knowledge, such as trips to mental health clinics in transportation data [34], query terms revealing political beliefs or sexual orientation of individuals in web data [27], or diseases associated with certain parts of DNA data [24]. Thus, it may be necessary to sanitize a string prior to its dissemination, so that confidential knowledge is not exposed. At the same time, it is important to preserve the utility of the sanitized string, so that data protection does not outweigh the benefits of disseminating the string to the party that disseminates or analyzes the string, or to the society at large. For example, a retailer should still be able to obtain actionable knowledge in the form of frequent patterns from the marketing agency who analyzed their outsourced data; and researchers should still be able to perform analyses such as identifying significant patterns in DNA sequences.
1.1 Our Model and Settings
Motivated by the discussion above, we introduce the following model which we call Combinatorial String Dissemination (CSD). In CSD, a party has a string that it seeks to disseminate, while satisfying a set of constraints and a set of desirable properties. For instance, the constraints aim to capture privacy requirements and the properties aim to capture data utility considerations (*e.g., *posed by some other party based on applications). To satisfy both, must be transformed to a string by applying a sequence of edit operations. The computational task is to determine this sequence of edit operations so that the transformed string satisfies the desirable properties subject to the constraints. Clearly, the constraints and the properties must be specified based on the application.
Under the CSD model, we consider two specific settings addressing practical considerations in common string processing applications; the Minimal String Length (MSL) setting, in which the goal is to produce a shortest string that satisfies the set of constraints and the set of desirable properties, and the Minimal Edit Distance (MED) setting, in which the goal is to produce a string that satisfies the set of constraints and the set of desirable properties and is at minimal edit distance from . In the following, we discuss each setting in more detail.
MSL Setting
In this setting, the sanitized string must satisfy the following constraint C1: for an integer , no given length- substring (also called pattern) modeling confidential knowledge should occur in . We call each such length- substring a sensitive pattern. We aim at finding the shortest possible string satisfying the following desired properties: (P1) the order of appearance of all other length- substrings (non-sensitive patterns) is the same in and in ; and (P2) the frequency of these length- substrings is the same in and in . The problem of constructing in this setting is referred to as TFS (Total order, Frequency, Sanitization). Note that it is straightforward to hide substrings of arbitrary lengths from , by setting equal to the length of the shortest substring we wish to hide, and then setting, for each of these substrings, any length- substring as sensitive.
The MSL setting is motivated by real-world applications involving string dissemination. In these applications, a data custodian disseminates the sanitized version of a string to a data recipient, for the purpose of analysis (*e.g., *mining). contains confidential information that the data custodian needs to hide, so that it does not occur in . Such information is specified by the data custodian based on domain expertise, as in [1, 6, 16, 21]. At the same time, the data recipient specifies P1 and P2 that must satisfy in order to be useful. These properties map directly to common data utility considerations in string analysis. By satisfying P1, allows tasks based on the sequential nature of the string, such as blockwise -gram distance computation [17], to be performed accurately. By satisfying P2, allows computing the frequency of length- substrings and hence mining frequent length- substrings [29] with no utility loss. We require that has minimal length so that it does not contain redundant information. For instance, the string which is constructed by concatenating all non-sensitive length- substrings in and separating them with a special letter that does not occur in , satisfies P1 and P2 but is not the shortest possible. Such a string will have a negative impact on the efficiency of any subsequent analysis tasks to be performed on it.
MED Setting
In this setting, the sanitized version of string must satisfy the properties P1 and P2, subject to the constraint C1, and also be at minimal edit distance from string . Constructing such a string allows many tasks that are based on edit distance to be performed accurately. Examples of such tasks are frequent pattern mining [31], clustering [19], entity extraction [37] and range query answering [23], which are important in domains such as bioinformatics [31], text mining [37], and speech recognition [13].
Note, existing works for sequential data sanitization (e.g., [6, 16, 18, 21, 36]) or anonymization (e.g., [3, 7, 10]) cannot be applied to our settings (see Section 8 for details).
1.2 Our Contributions
We define the TFS problem for string sanitization and a variant of it, referred to as PFS (Partial order, Frequency, Sanitization), which aims at producing an even shorter string by relaxing P1 of TFS. We also develop algorithms for TFS and PFS. Our algorithms construct strings and using a separator letter , which is not contained in the alphabet of , ensuring that sensitive patterns do not occur in or . The algorithms repeat proper substrings of sensitive patterns so that the frequency of non-sensitive patterns overlapping with sensitive ones does not change. For , we give a deterministic construction which may be easily reversible (*i.e., *it may enable a data recipient to construct from ), because the occurrences of reveal the exact location of sensitive patterns. For , we give a construction which breaks several ties arbitrarily, thus being less easily reversible. We further address the reversibility issue by defining the MCSR (Minimum-Cost Separators Replacement) problem and designing an algorithm for dealing with it. In MCSR, we seek to replace all separators, so that the location of sensitive patterns is not revealed, while preserving data utility. In addition, we define the problem of constructing in the MED setting, which is referred to as ETFS (Edit-distance, Total order, Frequency, Sanitization), and design an algorithm to solve it.
Our work makes the following specific contributions:
1. We design an algorithm, TFS-ALGO, for solving the TFS problem in time, where is the length of . In fact, we prove that time is worst-case optimal by showing that the length of is in in the worst case. The output of TFS-ALGO is a string consisting of a sequence of substrings over the alphabet of separated by (see Example 1 below). An important feature of our algorithm, which is useful in the efficient construction of discussed next, is that it can be implemented to produce an -sized representation of with respect to in time. See Section 3.
Example 1
Let , , and the set of sensitive patterns be . The string consists of three substrings over the alphabet separated by . Note that no sensitive pattern occurs in , while all non-sensitive substrings of length have the same frequency in and in (e.g., aaba appears once), and they appear in the same order in and in (e.g., aaba precedes abaa). Also, note that any shorter string than would either create sensitive patterns or change the frequencies (*e.g., *removing the last letter of creates a string in which caab no longer appears). ∎
2. We define the PFS problem relaxing P1 of TFS to produce shorter strings that are more efficient to analyze. Instead of a total order (P1), we require a partial order () that preserves the order of appearance only for sequences of successive non-sensitive length- substrings that overlap by letters. This makes sense because the order of two successive non-sensitive length- substrings with no length- overlap has anyway been “interrupted” (by a sensitive pattern). We exploit this observation to shorten the string further. Specifically, we design an algorithm that solves PFS in the optimal time, where is the length of , using the -sized representation of . See Section 4.
Example 2
(Cont’d from Example 1) Recall that . A string is aaacbcbbba#aabaabbacaab. The order of aaba and abaa is preserved in since they are successive, non-sensitive, and with an overlap of letters. The order of abaa and aaac, which are successive non-sensitive, is not preserved since they do not have an overlap of letters. ∎
3. We define the MCSR problem, which seeks to produce a string , by deleting or replacing all separators in with letters from the alphabet of so that: no sensitive patterns are reinstated in ; occurrences of spurious patterns that may not be mined from but can be mined from , at a given support threshold , are prevented; and the distortion incurred by the replacements in is bounded. The first requirement is to preserve privacy and the next two to preserve data utility. We show that MCSR is NP-hard and propose a heuristic to attack it. We also show how to apply the heuristic, so that letter replacements do not result in implausible (*i.e., * statistically unexpected) patterns that may reveal the location of sensitive patterns. See Section 5.
Example 3
(Cont’d from Example 2) Recall that . Let . A string is produced by replacing letter with letter c. Note that contains no sensitive pattern, nor a non-sensitive pattern of length- substring that could not be mined from at a support threshold (i.e., a pattern that does not occur in ). In addition, contains no implausible pattern, such as bbab, which is not expected to occur in , according to an established statistical significance measure for strings [8, 30, 4]. ∎
4. We design an algorithm for solving the ETFS problem. The algorithm, called ETFS-ALGO, is based on a connection between ETFS and the approximate regular expression matching problem [26]. Given a string and a regular expression , the latter problem seeks to find a string that matches and is at minimal edit distance from . ETFS-ALGO solves the ETFS problem in time, where is the size of the alphabet of . See Section 6.
Example 4
Let , , and the set of sensitive patterns be . TFS-ALGO constructs string , where is the empty string, with . On the contrary, ETFS-ALGO constructs string with . Clearly, string is more suitable for applications, which are based on measuring sequence similarity. ∎
5. For the MSL setting, we implemented our combinatorial approach for sanitizing a string (*i.e., *the aforementioned algorithms implementing the pipeline ) and show its effectiveness and efficiency on real and synthetic data. We also show that it possible to produce a string that does not contain implausible patterns, while incurring insignificant additional utility loss. See Section 7.
6. For the MED setting, we implemented ETFS-ALGO and experimentally compared it with TFS-ALGO. Interestingly, we demonstrate that TFS-ALGO constructs optimal or near-optimal solutions to the ETFS problem in practice. This is particularly encouraging because TFS-ALGO is linear in the length of the input string , whereas ETFS-ALGO is quadratic in . See Section 7.
A preliminary version of this paper, without the method that avoids implausible patterns and without contributions 4 and 6, appeared in [5]. Furthermore, we include here all proofs omitted from [5], as well as additional examples and discussion of related work.
2 Preliminaries, Problem Statements, and Main Results
Preliminaries
Let be a string of length over a finite ordered alphabet of size . By we denote the set of all strings over . By we denote the set of all length- strings over . For two positions and on , we denote by the substring of that starts at position and ends at position of . By we denote the empty string of length 0. A prefix of is a substring of the form , and a suffix of is a substring of the form . A proper prefix (suffix) of a string is not equal to the string itself. By we denote the number of occurrences of string in string . Given two strings and we say that has a suffix-prefix overlap of length with if and only if the length- suffix of is equal to the length- prefix of , i.e., .
We fix a string of length over an alphabet and an integer . We refer to a length- string or a pattern interchangeably. An occurrence of a pattern is uniquely represented by its starting position. Let be a set of positions over with the following closure property: for every , if there exists such that , then . That is, if an occurrence of a pattern is in all its occurrences are in . A substring of is called sensitive if and only if . is thus the set of occurrences of sensitive patterns. The difference set is the set of occurrences of non-sensitive patterns.
For any string , we denote by the set of occurrences of non-sensitive length- strings over in . (We have that .) We call an occurrence the t-predecessor of another occurrence in if and only if is the largest element in that is less than . This relation induces a strict total order on the occurrences in . We call the p-predecessor of in if and only if is the t-predecessor of in and has a suffix-prefix overlap of length with . This relation induces a strict partial order on the occurrences in . We call a subset of a t-chain (resp., p-chain) if for all elements in except the minimum one, their t-predecessor (resp., p-predecessor) is also in . For two strings and , chains and are equivalent, denoted by , if and only if and , where is the th smallest element of and is the th smallest of , for all .
Given two strings and the edit distance is defined as the minimum number of elementary edit operations (letter insertion, deletion, or substitution) to transform to .
The set of regular expressions over an alphabet is defined recursively as follows [26]: (I) , where denotes the empty string, is a regular expression. (II) If and are regular expressions, then so are , , and , where denotes the set of strings obtained by concatenating a string in and a string in , is the union of the strings in and , and consists of all strings obtained by concatenating zero or more strings from . Parentheses are used to override the natural precedence of the operators, which places the operator ∗ highest, the concatenation next, and the operator last. We state that a string matches a regular expression , if is equal to one of the strings in .
Problem Statements and Main Results
We define the following problem for the MSL setting.
Problem 1 (TFS)
Given , , , and construct the shortest string :
C1
* does not contain any sensitive pattern.*
P1
, i.e., the t-chains and are equivalent.
P2
, for all .
TFS requires constructing the shortest string in which all sensitive patterns from are concealed (C1), while preserving the order (P1) and the frequency (P2) of all non-sensitive patterns. Our first result is the following.
Theorem 2.1 ()
Let be a string of length over . Given and , TFS-ALGO solves Problem 1 in time, which is worst-case optimal. An -sized representation of can be built in time.
P1 implies P2, but P1 is a strong assumption that may result in long output strings that are inefficient to analyze. We thus relax P1 to require that the order of appearance remains the same only for sequences of successive non-sensitive length- substrings that also overlap by letters (p-chains). This leads to the following problem for the MSL setting.
Problem 2 (PFS)
Given , , , and construct a shortest string :
C1
* does not contain any sensitive pattern.*
1
There exists an injective function from the p-chains of to the p-chains of such that for any p-chain of .
P2
, for all .
Our second result, which builds on Theorem 2.1, is the following.
Theorem 2.2 ()
Let be a string of length over . Given and , PFS-ALGO solves Problem 2 in the optimal time.
To arrive at Theorems 2.1 and 2.2, we use a special letter (separator) when required. However, the occurrences of may reveal the locations of sensitive patterns. We thus seek to delete or replace the occurrences of in with letters from . The new string should not reinstate sensitive patterns or create implausible patterns. Given an integer threshold , we call a pattern a in if and only if but . Moreover, we seek to prevent * occurrences* in by also bounding the total weight of the letter choices we make to replace the occurrences of . This is the MCSR problem. We show that already a restricted version of the MCSR problem, namely, the version when , is NP-hard via the Multiple Choice Knapsack (MCK) problem [28].
Theorem 2.3 ()
The MCSR problem is NP-hard.
Based on this connection, we propose a non-trivial heuristic algorithm to attack the MCSR problem for the general case of an arbitrary .
We define the following problem for the MED setting.
Problem 3 (ETFS)
Given , , , and , construct a string which is at minimal edit distance from and satisfies the following:
C1
* does not contain any sensitive pattern.*
P1
, i.e., the t-chains and are equivalent.
P2
, for all .
We show how to reduce any instance of the ETFS problem to some instance of the approximate regular expression matching problem. In particular, the latter instance consists of a string of length (string ) and a regular expression of length . We thus prove the claim of Theorem 2.4 by employing the -time algorithm of [26].
Theorem 2.4 ()
Let be a string of length over an alphabet . Given and , ETFS-ALGO solves Problem 3 in time.
3 TFS-ALGO
We convert string into a string over alphabet , , by reading the letters of , from left to right, and appending them to while enforcing the following two rules:
R1: When the last letter of a sensitive substring is read from , we append to (essentially replacing this last letter of with ). Then, we append the succeeding non-sensitive substring (in the t-predecessor order) after .
R2: When the letters before are the same as the letters after , we remove and the succeeding letters (inspect Fig. 1).
R1 prevents from occurring in , and R2 reduces the length of (*i.e., *allows to hide sensitive patterns with fewer extra letters). Both rules leave unchanged the order and frequencies of non-sensitive patterns. It is crucial to observe that applying the idea behind R2 on more than letters would decrease the frequency of some pattern, while applying it on fewer than letters would create new patterns. Thus, we need to consider just R2 as-is.
Let be an array of size that stores the occurrences of sensitive and non-sensitive patterns: if and if . For technical reasons we set the last values in equal to ; *i.e., * . Note that is constructible from in time. Given and , TFS-ALGO efficiently constructs by implementing R1 and R2 concurrently as opposed to implementing R1 and then R2 (see the proof of Lemma 1 for details of the workings of TFS-ALGO and Fig. 1 for an example). We next show that string enjoys several properties.
Lemma 1
Let be a string of length over . Given and array , TFS-ALGO constructs the shortest string such that the following hold:
- (I)
There exists no with occurring in (C1).
- (II)
, i.e., the order of substrings , for all such that , is the same in and in ; conversely, the order of all substrings of is the same in and in (P1).
- (III)
, for all (P2).
- (IV)
The occurrences of letter in are at most and they are at least positions apart (P3).
- (V)
* and these bounds are tight (P4).*
Proof
C1: Index in TFS-ALGO runs over the positions of string ; at any moment it indicates the ending position of the currently considered length- substring of . When (Lines 1-1) TFS-ALGO never appends , *i.e., * the last letter of a sensitive length- substring, implying that, by construction of , no with occurs in .
P1: When (Lines 1-1) TFS-ALGO appends to , thus the order of and is clearly preserved. When and , index stores the starting position on of the -length suffix of the last non-sensitive substring appended to (see also Fig. 1). C1 ensures that no sensitive substring is added to in this case, nor when . The next letter will thus be appended to when and (Lines 1-1). The condition on Line 1 is satisfied if and only if the last non-sensitive length- substring appended to overlaps with the immediately succeeding non-sensitive one by letters: in this case, the last letter of the latter is appended to by Line 1, clearly maintaining the order of the two. Otherwise, Line 1 will append to , once again maintaining the length- substrings’ order. Conversely, by construction, any occurs in only if it equals a length- non-sensitive substring of . The only occasion when a letter from is appended to more then once is when Line 1 is executed: it is easy to see that in this case, because of the occurrence of , each of the repeated letters creates exactly one , without introducing any new length- string over nor increasing the occurrences of a previous one. Finally, Line 1 does not introduce any new except for the one present in , nor any extra occurrence of the latter, because it is only executed when two consecutive non-sensitive length- substrings of overlap exactly by letters.
P2: It follows from the proof for C1 and P1.
P3: Letter is added only by Line 1, which is executed only when and . This can be the case up to times as array can have alternate values only in the first positions. By construction, cannot start with (Lines 1-1), and thus the maximal number of occurrences of is . By construction, letter in is followed by at least letters (Line 1): the leftmost non-sensitive substring following a sequence of one or more occurrences of sensitive substrings in .
P4:Upper bound. TFS-ALGO increases the length of string by more than one letter only when letter is added to (Line 1). Every time Lines 1-1 are executed, the length of increases by letters. Thus the length of is maximized when the maximal number of occurrences of is attained. This length is thus bounded by .
Tightness. For the lower bound, let and be sensitive. The condition at Line 1 is not satisfied because no element in is set to 0: . Then the condition on Line 1 is also not satisfied because , and thus TFS-ALGO outputs the empty string. A de Bruijn sequence of order over an alphabet is a string in which every possible length- string over occurs exactly once as a substring. For the upper bound, let be the order- de Bruijn sequence over alphabet , be even, and . and so Line 1 will add the first letters of to . Then observe that , and so on; this sequence of values corresponds to satisfying Lines 1 and 1 alternately. Line 1 does not add any letter to . The if statement on Line 1 will always fail because of the de Bruijn sequence property. We thus have a sequence of the non-sensitive length- substrings of interleaved by occurrences of appended to . TFS-ALGO thus outputs a string of length (see Example 5).
We finally prove that has minimal length. Let be the prefix of string obtained by processing . Let . We will proceed by induction on , claiming that is the shortest string such that C1 and P1-P4 hold for . We call such a string optimal.
Base case: . By Lines 1-1 of TFS-ALGO, is equal to the first non-sensitive length- substring of , and it is clearly the shortest string such that C1 and P1-P4 hold for .
Inductive hypothesis and step: is optimal for . If , and this is clearly optimal. If , thus still optimal. Finally, if and we have two subcases: if then , and once again is evidently optimal. Otherwise, . Suppose by contradiction that there exists a shorter such that C1 and P1-P4 still hold: either drop or append less than letters after . If we appended less than letters after , since TFS-ALGO will not read ever again, P2-P3 would be violated, as an occurrence of would be missed. Without , the last letters of would violate either C1 or P1 and P2 (since we suppose ). Then is optimal. ∎
Example 5 (Illustration of P3)
Let . We construct the order- de Bruijn sequence of length over alphabet , and choose . TFS-ALGO constructs:
[TABLE]
The upper bound of on the length of is attained. ∎
Let us now show the main result of this section.
See 2.1
Proof
For the first part inspect TFS-ALGO. Lines 1-1 can be realized in time. The while loop in Line 1 is executed no more than times, and every operation inside the loop takes time except for Line 1 and Line 1 which take time. Correctness and optimality follow directly from Lemma 1 (P4).
For the second part, we assume that is represented by and a sequence of pointers to interleaved (if necessary) by occurrences of . In Line 1, we can use an interval to represent the length- substring of added to . In all other lines (Lines 1, 1 and 1) we can use as one letter is added to per one letter of . By Lemma 1 we can have at most occurrences of letter . The check at Line 1 can be implemented in constant time after linear-time pre-processing of for longest common extension queries [12]. All other operations take in total linear time in . Thus there exists an -sized representation of and it is constructible in time. ∎
4 PFS-ALGO
Lemma 1 tells us that is the shortest string satisfying constraint C1 and properties P1-P4. If we were to drop P1 and employ the partial order (see Problem 2), the length of would not always be minimal: if a permutation of the strings contains pairs , with a suffix-prefix overlap of length , we may further apply R2, obtaining a shorter string.
To find such a permutation efficiently and construct a shorter string from , we propose PFS-ALGO. The crux of our algorithm is an efficient method to solve a variant of the classic NP-complete Shortest Common Superstring (SCS) problem [15]. Specifically our algorithm: (I) Computes the string using Theorem 2.1. (II) Constructs a collection of strings, each of two letters (two ranks); the first (resp., second) letter is the lexicographic rank of the length- prefix (resp., suffix) of each string in the collection . (III) Computes a shortest string containing every element in as a distinct substring. (IV) Constructs by mapping back each element to its distinct substring in . If there are multiple possible shortest strings, one is selected arbitrarily.
Example 6 (Illustration of the workings of PFS-ALGO)
Let and
[TABLE]
The collection is comprised of the following substrings: , , and . The collection is comprised of the following two-letter strings: . To construct , we first find the length- prefix and the length- suffix of each , , which leads to a collection . Then, we sort the collection lexicographically to obtain , and last we replace each , , with the lexicographic ranks of its length- prefix and length- suffix. For instance, is replaced by . After that, a shortest string containing all elements of as distinct substrings is computed as: . This shortest string is mapped back to the solution . Note, contains one occurrence of and has length , while contains occurrences of and has length . ∎
We now present the details of PFS-ALGO. We first introduce the Fixed-Overlap Shortest String with Multiplicities (FO-SSM) problem: Given a collection of strings and an integer , with , for all , FO-SSM seeks to find a shortest string containing each element of as a distinct substring using the following operations on any pair of strings :
- (I)
;
- (II)
-.
Any solution to FO-SSM with and implies a solution to the PFS problem, because for all ’s (see Lemma 1, P3)
The FO-SSM problem is a variant of the SCS problem. In the SCS problem, we are given a set of strings and we are asked to compute the shortest common superstring of the elements of this set. The SCS problem is known to be NP-complete, even for binary strings [15]. However, if all strings are of length two, the SCS problem admits a linear-time solution [15]. We exploit this crucial detail positively to show a linear-time solution to the FO-SSM problem in Lemma 3. In order to arrive to this result, we first adapt the SCS linear-time solution of [15] to our needs (see Lemma 2) and plug this solution into Lemma 3.
Lemma 2
Let be a collection of strings, each of length two, over an alphabet . We can compute a shortest string containing every element of as a distinct substring in time.
Proof
We sort the elements of lexicographically in time using radixsort. We also replace every letter in these strings with their lexicographic rank from in time using radixsort. In time we construct the de Bruijn multigraph of these strings [9]. Within the same time complexity, we find all nodes in with in-degree, denoted by , smaller than out-degree, denoted by . We perform the following two steps:
Step 1
While there exists a node in with , we start an arbitrary path (with possibly repeated nodes) from , traverse consecutive edges and delete them. Each time we delete an edge, we update the in- and out-degree of the affected nodes. We stop traversing edges when a node with is reached: whenever , we also delete from . Then, we add the traversed path to a set of paths. The path can contain the same node more than once. If is empty we halt. Proceeding this way, there are no two elements and in such that starts with and ends with ; thus this path decomposition is minimal. If is not empty at the end, by construction, it consists of only cycles.
Step 2
While is not empty, we perform the following. If there exists a cycle that intersects with any path in we splice into , update with the result of splicing, and delete from . This operation can be efficiently implemented by maintaining an array of size of linked lists over the paths in : stores a list of pointers to all occurrences of letter in the elements of . Thus in constant time per node of we check if any such path exists in and splice the two in this case. If no such path exists in , we add to any of the path-linearizations of the cycle, and delete the cycle from . After each change to , we update and delete every node with from .
The correctness of this algorithm follows from the fact that is a minimal path decomposition of . Thus any concatenation of paths in represents a shortest string containing all elements in as distinct substrings. ∎
Lemma 3
Let be a collection of strings over an alphabet . Given an integer , the FO-SSM problem for can be solved in time.
Proof
Consider the following renaming technique. Each length- substring of the collection is assigned a lexicographic rank from the range . Each string in is converted to a two-letter string as follows. The first letter is the lexicographic rank of its length- prefix and the second letter is the lexicographic rank of its length- suffix. We thus obtain a new collection of two-letter strings. Computing the ranks for all length- substrings in can be implemented in time by employing radixsort to sort and then the well-known LCP data structure over the concatenation of strings in [12]. The FO-SSM problem is thus solved by finding a shortest string containing every element of as a distinct substring. Since consists of two-letter strings only we can solve the problem in time by applying Lemma 2. The statement follows. ∎
Thus, PFS-ALGO applies Lemma 3 on with (recall that ). Note that each time the concat operation is performed, it also places the letter in between the two strings.
Lemma 4
Let be a string of length over an alphabet . Given and array , PFS-ALGO constructs a shortest string with C1, , and P2-P4.
Proof
C1 and P2 hold trivially for as no length- substring over is added or removed from . Let . The order of non-sensitive length- substrings within , for all , is preserved in . Thus there exists an injective function from the p-chains of to the p-chains of such that for any p-chain of ( is preserved). P3 also holds trivially for as no occurrence of is added. Since , for P4, it suffices to note that the construction of in the proof of tightness in Lemma 1 (see also Example 5) ensures that there is no suffix-prefix overlap of length between any pair of length- substrings of over due to the property of the order- de Bruijn sequence. Thus the upper bound of on the length of is also tight for .
The minimality on the length of follows from the minimality of and the correctness of Lemma 3 that computes a shortest such string. ∎
Let us now show the main result of this section.
See 2.2
Proof
We compute the -sized representation of string with respect to described in the proof of Theorem 2.1. This can be done in time. If , then we construct and return in time from the representation. If , implying , we compute the LCP data structure of string in time [12]; and implement Lemma 3 in time by avoiding to read string explicitly: we rather rename to a collection of two-letter strings by employing the LCP information of directly. We then construct and report in time . Correctness follows directly from Lemma 4. ∎
5 MCSR Problem, MCSR-ALGO, and Implausible Pattern Elimination
In the following, we introduce the MCSR problem and prove that it is NP-hard (see Section 5.1). Then, we introduce MCSR-ALGO, a heuristic to address this problem (see Section 5.2). Finally, we discuss how to configure MCSR-ALGO in order to eliminate implausible patterns (see Section 5.3).
5.1 The MCSR Problem
The strings and , constructed by TFS-ALGO and PFS-ALGO, respectively, may contain the separator , which reveals information about the location of the sensitive patterns in . Specifically, a malicious data recipient can go to the position of a in and “undo” Rule R1 that has been applied by TFS-ALGO, removing and the letters after from . The result could be an occurrence of the sensitive pattern. For example, applying this process to the first in shown in Fig. 1, results in recovering the sensitive pattern abab. A similar attack is possible on the string produced by PFS-ALGO, although it is hampered by the fact that substrings within two consecutive s in often swap places in .
To address this issue, we seek to construct a new string , in which s are either deleted or replaced by letters from . To preserve data utility, we favor separator replacements that have a small cost in terms of occurrences of -ghosts (patterns with frequency less than in and at least in and incur a level of distortion bounded by a parameter in . The cost of an occurrence of a -ghost at a certain position is given by function Ghost, while function Sub assigns a distortion weight to each letter that could replace a . Both functions will be described in further detail below.
To preserve privacy, we require separator replacements not to reinstate sensitive patterns. This is the MCSR problem, a restricted version of which is presented in Problem 4. The restricted version is referred to as and differs from MCSR in that it uses for the pattern length instead of an arbitrary value . is presented next for simplicity and because it is used in the proof of Lemma 5. Lemma 5 implies Theorem 2.3.
Problem 4 ()
Given a string over an alphabet with occurrences of letter , and parameters and , construct a new string by substituting the occurrences of in with letters from , such that:
(I)* is minimum, and (II) .*
Lemma 5
The problem is NP-hard.
Proof
We reduce the NP-hard Multiple Choice Knapsack (MCK) problem [32] to in polynomial time. In MCK, we are given a set of elements subdivided into , mutually exclusive classes, , and a knapsack. Each class has elements. Each element has an arbitrary cost and an arbitrary weight . The goal is to minimize the total cost (Eq. 1) by filling the knapsack with one element from each class (constraint II), such that the weights of the elements in the knapsack satisfy constraint I, where constant represents the minimum allowable total weight of the elements in the knapsack:
[TABLE]
subject to the constraints: (I) , (II), and (III) .
The variable takes value if the element is chosen from class , [math] otherwise (constraint III). We reduce any instance to an instance in polynomial time, as follows:
- (I)
Alphabet consists of letters , for each and each class , .
- (II)
We set . Every element of occurs exactly once: . Letter occurs times in . For convenience, let us denote by the th occurrence of in .
- (III)
We set and .
- (IV)
and . The functions are otherwise not defined.
This is clearly a polynomial-time reduction. We now prove the correspondence between a solution to the given instance and a solution to the instance .
We first show that if is a solution to , then is a solution to . Since the elements in have minimum , , and , the letters corresponding to the selected elements lead to a that incurs a minimum
[TABLE]
In addition, each letter that is considered by the inner sum of Eq. 2 corresponds to a single occurrence of , and these are all the occurrences of . Thus we obtain that
[TABLE]
(*i.e., *condition I in Problem 4 is satisfied). Since the elements in have total weight , the letters , they map to, lead to a with , which implies
[TABLE]
(*i.e., *condition II in Problem 4 is satisfied). is thus a solution to .
We finally show that, if is a solution to , then is a solution to . Since each , , is replaced by a single letter in , exactly one element will be selected from each class (*i.e., *conditions II-III of MCK are satisfied). Since the letters in satisfy condition I of Problem 4, every element of occurs exactly once in , and , their corresponding selected elements will have a minimum total cost. Since satisfies , the selected elements that correspond to will satisfy , which implies (*i.e., *condition I of MCK is satisfied). Therefore, is a solution to . The statement follows. ∎
Lemma 5 implies the main result of this section.
See 2.3
The cost of -ghosts is captured by a function Ghost. This function assigns a cost to an occurrence of a , which is caused by a separator replacement at position , and is specified based on domain knowledge. For example, with a cost equal to for each gained occurrence of each , we penalize more heavily a -ghost with frequency much below in and the penalty increases with the number of gained occurrences. Moreover, we may want to penalize positions towards the end of a temporally ordered string, to avoid spurious patterns that would be deemed important in applications based on time-decaying models [11].
The replacement distortion is captured by a function Sub which assigns a weight to a letter that could replace a and is specified based on domain knowledge. The maximum allowable replacement distortion is . Small weights favor the replacement of separators with desirable letters (*e.g., *letters that reinstate non-sensitive frequent patterns) and letters that reinstate sensitive patterns are assigned a weight larger than that prohibits them from replacing a . As will be explained in Section 5.3, weights larger than are also assigned to letters which would lead to implausible substrings [18] if they replaced s.
5.2 MCSR-ALGO
We next present MCSR-ALGO, a non-trivial heuristic that exploits the connection of the MCSR and MCK [28] problems. We start with a high-level description of MCSR-ALGO:
- (I)
Construct the set of all candidate -ghost patterns (*i.e., *length- strings over with frequency below in that can have frequency at least in ).
- (II)
Create an instance of MCK from an instance of MCSR. For this, we map the th occurrence of to a class in MCK and each possible replacement of the occurrence with a letter to a different item in . Specifically, we consider all possible replacements with letters in and also a replacement with the empty string, which models deleting (instead of replacing) the th occurrence of . In addition, we set the costs and weights that are input to MCK as follows. The cost for replacing the th occurrence of with the letter is set to the sum of the Ghost function for all candidate -ghost patterns when the th occurrence of is replaced by . That is, we make the worst-case assumption that the replacement forces all candidate -ghosts to become -ghosts in . The weight for replacing the th occurrence of with letter is set to .
- (III)
Solve the instance of MCK and translate the solution back to a (possibly suboptimal) solution of the MCSR problem. For this, we replace the th occurrence of with the letter corresponding to the element chosen by the MCK algorithm from class , and similarly for each other occurrence of . If the instance has no solution (*i.e., *no possible replacement can hide the sensitive patterns), MCSR-ALGO reports that cannot be constructed and terminates.
Lemma 6 below states the running time of an efficient implementation of MCSR-ALGO.
Lemma 6
MCSR*-ALGO runs in time, where is the running time of the MCK algorithm for classes with elements each.*
Proof
It should be clear that if we conceptually extend with the empty string, our approach takes into account the possibility of deleting (instead of replacing) an occurrence of . To ease comprehension though we only describe the case of letter replacements.
Step 1
Given , , , , and , we construct a set of candidate -ghosts as follows. The candidates are at most distinct strings of length . The first term corresponds to all substrings of length over occurring in (*i.e., *if did not contain , we would have such substrings; each of the causes the loss of such substrings). The second term corresponds to all possible substrings of length that may be introduced in but do not occur in . For any string from the set of these strings, we want to compute and its maximal frequency in , denoted by , *i.e., *the largest possible frequency that can have in , to construct set . Let denote the string of length , containing the consecutive length- substrings, obtained after replacing the th occurrence of with letter in .
- (I)
If , by definition can never become -ghost in , and we thus exclude it from . , for all occurring in , can be computed in total time using the suffix tree of [12].
- (II)
If , by definition can never become -ghost in , and we thus exclude it from . can be computed by adding to , the maximum additional number of occurrences of caused by a letter replacement among all possible letter replacements. We sum up this quantity for each and for all replacements of occurrences of to obtain . To do this, we first build the generalized suffix tree of in time [12]. We then spell , for all , in the generalized suffix tree in time per . We exploit suffix links to spell the length- substrings of in time and memorize the maximum number of occurrences of caused by replacing the th occurrence of among all . We represent set on the generalized suffix tree by marking the corresponding nodes, and we denote this representation by . The total size of this representation is .
Step 2
We now want to construct an instance of the MCK problem using . We first set letter as element of class . We then set equal to the sum of the Ghost function cost incurred by replacing the th occurrence of by letter for all (at most ) affected length- substrings that are marked in . The main assumption of our heuristic is precisely the fact that we assume that this letter replacement will force all of these affected length- substrings becoming -ghosts in . The computation of is done as follows. For each , and , we have substrings whose frequency changes, each of length . Let be one such pattern occurring at position of , where and is the th occurrence of in . We check if is marked in or not. If is not marked we add nothing to . If is marked, we increment by . We also set (as stated above, any letter that reinstates a sensitive pattern is assigned a weight , so that it cannot be selected to replace an occurrence of in Step ). Similar to Step 1, the total time required for this computation is .
Step 3
In Step 2, we have computed and , for all , and . We thus have an instance of the MCK problem. We solve it and translate the solution back to a (suboptimal) solution of the MCSR problem: the element chosen by the MCK algorithm from class corresponds to letter and it is used to replace the th occurrence of , for all . The cost of solving MCK depends on the chosen algorithm and is given by a function .
Thus, the total cost of MCSR-ALGO is . ∎
5.3 Eliminating Implausible Patterns
We present the notion of implausible substring and explain how we can ensure that implausible patterns do not occur in , as a result of applying the MCSR-ALGO algorithm to string .
Consider, for instance, an input string that models the movement of an individual, and the string abc, which is created as a substring of when we replace with b. Consider further that an individual can, generally, not go from a to c through b, or that it is highly unlikely for them to do so. We call a substring such as abc implausible. Clearly, if abc occurs in , it may be possible for an attacker to infer that b replaced , and then infer a sensitive pattern by “undoing” R1 as explained in Section 5.1. In order to effectively model this scenario, we define implausible patterns based on a statistical significance measure for strings [8, 30, 4]. The measure is defined as follows [8]:
[TABLE]
where is a string with , is the reference string, and
[TABLE]
is the expected frequency of in , computed based on an independence assumption between the event “ occurs in ” and “ occurs in ”. The measure is a normalized version of the standard score of , based on the fact that the variance [30]. A small indicates that occurs less likely than expected, and hence it can naturally be considered as an artefact of sanitization.
Given a user-defined threshold , we define a string as -implausible if . The set of -implausible substrings of can be computed in the optimal time [4]. We use as the reference string, assuming that it is a good representation of the domain; *e.g., *a trip (substring) that is -implausible in is also implausible in general. Alternatively, one could use any other string as reference, impose length constraints on implausible patterns [22, 33], or even directly specify substrings that should not occur in based on domain knowledge.
Given the set of (-)implausible patterns, we ensure that no replacement creates in , where is the letter that replaces , by assigning a weight , for each such that and . This guarantees that no replacement leading to an artefact occurrence of an element of is performed by MCSR-ALGO. Note, however, that a -implausible pattern may occur in as a substring, either because it occurred in a part of that was copied to (*e.g., *a non-sensitive pattern), or due to the change of frequency of some substrings that are created in after the replacement of a . However, since such -implausible patterns did not contain a in the first place, they cannot be exploited by an attacker seeking to reverse the construction of .
6 ETFS-ALGO
Let and be two non-sensitive length- substrings of such that is the -predecessor of . Since and must occur in the same order in the solution string , the main choice we have to make in order to solve the ETFS problem is whether to:
- (I)
“merge” and when the length- suffix of and the length- prefix of match; or 2. (II)
“interleave” and with a carefully selected string over .
Among operations I and II, for every such pair and , we must select the operation that globally results in the smallest number of edit operations. Operations I and II can naturally be expressed by means of a regular expression . In particular, this implies that any instance of the ETFS problem can be reduced to an instance of approximate regular expression matching and thus an algorithm for approximate regular expression matching between and [26] can be employed. More formally, given a string and a regular expression , the approximate regular expression matching problem is to find a string that matches with minimal . The following result is known.
Theorem 6.1 ([26])
Given a string and a regular expression , the approximate regular expression matching problem can be solved in time.
In the following, we define a specific type of a regular expression . Let us first define the following regular expression:
[TABLE]
where is the alphabet of and . We also define the following regular-expression gadgets, for a letter :
[TABLE]
Intuitively, the gadget represents a string we may choose to include in the output in an effort to minimize the edit distance between and the solution string . It should be clear that the length of is in and that cannot generate any length- substring over . Furthermore, inserting in cannot create any sensitive or non-sensitive pattern due to the occurrences of on both ends of . The gadgets and are similar to . They are added in the beginning and at the end of , respectively. This is because should not start or end with as this would only increase the edit distance to . As it will be explained later, to construct , we also make use of the operator. Intuitively, the operator represents the choice we make between operation “merge” or “interleave”.
We are now in a position to describe ETFS-ALGO, an algorithm for solving the ETFS problem. ETFS-ALGO starts by constructing . Let be the sequence of non-sensitive length- substrings as they occur in from left to right. We first set and then process the pairs of non-sensitive length- substrings and , for all . At the th step, we examine whether or not and can be merged. If they can, we append to a regular expression , where is obtained by chopping-off the length- prefix of (that is, the remainder of after merging it with ). Otherwise, we append to . Intuitively, using corresponds to choosing “merge” and to choosing “interleave”. After examining each pair and , we append to . This concludes the construction of . Note how, for any combination of choices, will always appear in the string obtained.
Next, ETFS-ALGO employs Theorem 6.1 to construct . In particular, it finds a string that matches with minimal . Last, it sets . We arrive at the main result of this section.
See 2.4
Proof
Constructing can be done in time, since: (I) The non-sensitive length- substrings of can be obtained in time, by reading from left to right and checking . (II) Checking whether and are mergeable takes time via letter comparisons, and it is performed in each of the steps. (III) The length is . This is because contains at most occurrences of non-sensitive length- substrings, at most occurrences of , and one occurrence of each of and and because the lengths of , and are .
Computing from and can be performed in time using Theorem 6.1. Thus ETFS-ALGO takes time in total.
The correctness of ETFS-ALGO follows from the fact that by construction: (I) does not contain any sensitive pattern, so C1 is satisfied; (II) satisfies P1 and P2 as no length- substring over (other than the non-sensitive ones) is inserted in ; (III) All strings satisfying C1, P1 and P2 can be obtained by , since they must have the same t-chain of non-sensitive patterns over as , interleaved by length- substrings that are on but not on ; and (IV) the minimality on edit distance is guaranteed by Theorem 6.1. The statement follows. ∎
Example 7 (Illustration of the workings of ETFS-ALGO)
Let , , and the set of sensitive patterns be . The sequence of non-sensitive patterns is thus . Given that and , ETFS-ALGO constructs the following gadgets,
[TABLE]
[TABLE]
[TABLE]
and sets . Then, it iterates over each pair of successive non-sensitive length- substrings in the order they appear in (*i.e., *pair is considered in Step ) and the regular expression is updated, as detailed below.
In Step , ETFS-ALGO considers the pair . Observe that in this case and can be merged, since the length- suffix of and the length- prefix of match. Thus, is appended to . Recall that when merging, we chop off the length- prefix of (because we have merged it already) and write down what is left of (a in this case) before . Thus, .
In Step 2, ETFS-ALGO considers . Again, and can be merged. Thus, is appended into , which leads to .
In Steps 3 and 4, ETFS-ALGO considers the pairs and , respectively. Since the patterns in each pair can be merged, the algorithm appends into the regular expression and , for the first and second pair, respectively. This leads to .
In Step 5, ETFS-ALGO considers the last pair , which cannot be merged, and appends to . Since there is no other pair to be considered, is also appended to , leading to:
[TABLE]
At this point, ETFS-ALGO employs Theorem 6.1 to find the following string that matches (the choices that were made in the construction of are underlined in and , , are matched by the empty string):
[TABLE]
with minimal . Last, ETFS-ALGO returns . ∎
Note that in Example 7 does not contain any sensitive pattern and that all non-sensitive patterns of appear in in the same order and with the same frequency as they appear in . Note also that, for the same instance, TFS-ALGO would return string aaabaccb#cbbb with and .
7 Experimental Evaluation
We evaluate our algorithms in terms of effectiveness and efficiency. Effectiveness is measured based on data utility and number of implausible patterns. Efficiency is measured based on runtime.
Evaluated Algorithms
First, we consider the pipeline TFS-ALGO PFS-ALGOMCSR-ALGO, referred to as TPM. Given a string over , TPM sanitizes by applying TFS-ALGO, PFS-ALGO, and then MCSR-ALGO. MCSR-ALGO uses the -time algorithm of [28] for solving the MCK instances. The final output is a string over . MCSR-ALGO is configured with an empty set (*i.e., *it may lead to implausible patterns that are created in after the replacement of a ).
We did not compare TPM against existing methods, because they are not alternatives to TPM (see Section 8 for more details on related work). Instead, we compared TPM against a greedy baseline referred to as BA, in terms of data utility and efficiency. BA initializes its output string to and then considers each sensitive pattern in , from left to right. For each , BA replaces the letter of that has the largest frequency in with another letter that is not contained in and has the smallest frequency in , breaking all ties arbitrarily. Note that this letter replacement should not introduce any other sensitive pattern in . If no such exists, is replaced by to ensure that a solution is produced (even if it may reveal the location of a sensitive pattern). Each replacement removes the occurrence of and aims to prevent -ghost occurrences by selecting an that will not substantially increase the frequency of patterns overlapping with . Note that BA does not preserve the frequency of non-sensitive patterns, and thus, unlike TPM, it can incur -lost patterns. We also implemented a similar baseline that replaces the letter in that has the smallest frequency in with another letter that is not contained in and has the largest frequency in , but omit its results as it was worse than BA.
In addition, we consider the pipelines TFS-ALGOMCSR-ALGO and TFS-ALGOMCSRI-ALGO, referred to as TM and TMI, respectively. With MCSRI-ALGO we refer to the configuration of MCSR in which there is a non-empty set of -implausible patterns that must not occur in the output string . We omit PFS-ALGO from the TM and TMI pipelines to avoid the elimination of some implausible patterns due to re-ordering of blocks of non-sensitive patterns that is performed by PFS-ALGO.
Last, we consider ETFS-ALGO, which we compare to TFS-ALGO, to demonstrate that the latter is a very effective heuristic for the ETFS problem.
Experimental Data
We considered the following publicly available datasets used in [1, 16, 18, 21]: Oldenburg (OLD), Trucks (TRU), MSNBC (MSN), the complete genome of Escherichia coli (DNA), and synthetic data (uniformly random strings, the largest of which is referred to as SYN). See Table 1 for the characteristics of these datasets and the parameter values used in experiments, unless stated otherwise.
Experimental Setup
The sensitive patterns were selected randomly among the frequent length- substrings at minimum support following [16, 18, 21]. We used the fairly low values , , , and for TRU, OLD, MSN, and DNA, respectively, to have a wider selection of sensitive patterns. In MCSR-ALGO, we used a uniform cost of for every occurrence of each -ghost, a weight of (resp., ) for each letter replacement that does not (resp., does) create a sensitive pattern, and we further set . This setup treats all candidate -ghost patterns and all candidate letters for replacement uniformly, to facilitate a fair comparison with BA which cannot distinguish between -ghost candidates or favor specific letters. In MCSRI-ALGO, we instead set a weight for each letter replacement that does not create a sensitive pattern or an implausible pattern of length .
To capture the utility of sanitized data, we used the (frequency) distortion measure
[TABLE]
where is a non-sensitive pattern. The distortion measure quantifies changes in the frequency of non-sensitive patterns with low values suggesting that remains useful for tasks based on pattern frequency (*e.g., *identifying motifs corresponding to functional or conserved DNA [29]).
We also measured the number of -ghost and -lost patterns in following [16, 18, 21], where a pattern is in if and only if but . That is, -lost patterns model knowledge that can no longer be mined from but could be mined from , whereas -ghost patterns model knowledge that can be mined from but not from . A small number of -lost/ghost patterns suggests that frequent pattern mining can be accurately performed on [16, 18, 21]. Unlike BA, by design TPM does not incur any -lost pattern, as TFS-ALGO and PFS-ALGO preserve frequencies of non-sensitive patterns, and MCSR-ALGO can only increase pattern frequencies.
To examine the benefit of using MCSRI-ALGO instead of MCSR-ALGO when implausible patterns need to be eliminated, we measured the percentage of -implausible patterns of length that may occur in , when a letter replaces a . Clearly, the percentage is [math] when MCSRI-ALGO is used, and a large percentage for MCSR-ALGO implies that it is beneficial to use MCSRI-ALGO instead.
To capture the effectiveness of TFS-ALGO in terms of constructing a string that is at small edit distance from (see the ETFS problem), we used the Edit Distance Relative Error, defined as
[TABLE]
All experiments ran on a Desktop PC with an Intel Xeon E5-2640 at 2.66GHz and 16GB RAM. Our source code is written in C++. The results presented below have been averaged over runs.
7.1 TPM vs. BA
Data Utility
We first demonstrate that TPM incurs very low distortion, which implies high utility for tasks based on the frequency of patterns (e.g., [29]). Fig. 2 shows that, for varying number of sensitive patterns, TPM incurred on average (and up to ) times lower distortion than BA over all experiments. Also, Fig. 2 shows that TPM remains effective even in challenging settings, with many sensitive patterns (*e.g., *the last point in Fig. 2(b) where about of the positions in are sensitive). Fig. 3 shows that, for varying , TPM caused on average (and up to ) times lower distortion than BA over all experiments.
Next, we demonstrate that TPM permits accurate frequent pattern mining: Fig. 4 shows that TPM led to no -lost or -ghost patterns for the TRU and MSN datasets. This implies no utility loss for mining frequent length- substrings with threshold . In all other cases, the number of -ghosts was on average (and up to ) times smaller than the total number of -lost and -ghost patterns for BA. BA performed poorly (*e.g., *up to of frequent patterns became -lost for TRU and for DNA). Fig. 5 shows that, for varying , TPM led to on average (and up to ) times fewer -lost/ghost patterns than BA. BA performed poorly (*e.g., *up to of frequent patterns became -lost for DNA).
We also demonstrate that PFS-ALGO reduces the length of the output string of TFS-ALGO substantially, creating a string that contains less redundant information and allows for more efficient analysis. Fig. 6(a) shows the length of and of and their difference for . was much shorter than and its length decreased with the number of sensitive patterns, since more substrings had a suffix-prefix overlap of length and were removed (see Section 4). Interestingly, the length of was close to that of (the string before sanitization). A larger led to less substantial length reduction as shown in Fig. 6(b) (but still few thousand letters were removed), since it is less likely for long substrings of sensitive patterns to have an overlap and be removed.
Efficiency
We finally measured the runtime of TPM using prefixes of the synthetic string SYN whose length is million letters. Fig. 6(c) (resp., Fig. 6(d)) shows that TPM scaled linearly with (resp., ), as predicted by our analysis in Section 5 (TPM takes time, since the algorithm of [28] was used for MCK instances). In addition, TPM is efficient, with a runtime similar to that of BA and less than seconds for SYN.
7.2 TM vs. TMI
We compare TM with TMI based on data utility and the number of implausible patterns incurred. The objective of these experiments is to show that TMI is able to produce a string that does not contain implausible patterns, while being comparable to TM in terms of the amount of distortion and number of ghost patterns incurred.
We do not report the results of comparing TM with TMI in terms of efficiency, because the runtime of TMI was almost identical to that of TM.
Impact of
We first demonstrate that many implausible patterns may occur as a result of replacing s with letters, when MCSR is used. This can be seen from Figs. 7(a), 7(b), and 7(c), which show the percentage of implausible patterns incurred by TM, for varying in OLD, TRU, and MSN, respectively. The percentage is on average (and up to ). The percentage for DNA is (omitted), because this dataset has a very small alphabet size. Thus, in this experiment, MCSR-ALGO and MCSRI-ALGO are essentially the same algorithm. Since TMI is guaranteed to eliminate implausible patterns, its corresponding percentages are zero (omitted).
We then demonstrate that TMI eliminates implausible patterns without incurring substantial utility loss compared to TM. Figs. 8 and 9 show that TMI incurred a comparable amount of distortion to TM. Specifically, TMI incurred and less distortion in the case of OLD and TRU datasets and more distortion in the case of MSN. TMI also incurred a similar number of ghosts than TM. Specifically, TMI incurred fewer ghosts in the case of TRU and more ghosts in the case of MSN. Note that no -ghost patterns were incurred in the case of OLD (for both TM and TMI). The worse performance of TMI in the case of the MSN dataset is attributed to its relatively small alphabet size, which makes it more difficult to select a letter replacement that does not incur implausible patterns.
Impact of
Fig. 10(a) shows that the percentage of implausible patterns incurred by TM for the OLD dataset was on average (and up to ). Again, this confirms the need to eliminate implausible patterns in practice. The results for TRU, MSN, and DNA are qualitatively similar and omitted from all remaining experiments.
We now demonstrate that TMI eliminates implausible patterns, while incurring a comparable amount of distortion and ghosts (on average) compared to TM. Specifically, the distortion for TMI was lower than TM on average (see Fig. 10(b)), and the number of -ghost patterns for TMI was lower on average (see Fig. 10(c)).
Impact of
We demonstrate that TMI can eliminate implausible patterns, while preserving data utility as well as TM does. This can be seen from Fig. 11(a), which shows that the percentage of implausible patterns incurred by TM was on average (and up to ), and from Figs. 11(b) and 11(c), which show that TMI caused on average lower distortion and fewer -ghosts, respectively, compared to TM.
7.3 TFS-ALGO vs. ETFS-ALGO
We demonstrate that TFS-ALGO is a very effective heuristic for the ETFS problem. Specifically, it constructs a string that is either an optimal solution to the problem or it is at slightly larger edit distance from compared to the exact solution string that is constructed by ETFS-ALGO. This can be seen from Fig. 12(a) (resp., 12(b)), which shows that TFS-ALGO constructed optimal solutions (*i.e., * Edit Distance Relative Error was [math]) in (resp., ) of the tested strings, on average. These strings are uniformly random and have the same length and alphabet as SYN. Qualitatively similar results were obtained for uniformly random strings of different lengths and alphabet sizes (omitted). In addition, the effectiveness of TFS-ALGO can be seen from Figs. 12(c) and 12(d), which show that the Edit Distance Relative Error in TRU was no more than . These results are encouraging because, unlike ETFS-ALGO, TFS-ALGO is applicable to large strings such as OLD, MSN, and DNA (recall that its time complexity is linear instead of quadratic in ).
8 Related Work
Data sanitization aims at concealing confidential information from a dataset prior to its dissemination. In privacy-preserving data mining, data sanitization (a.k.a. knowledge hiding) aims at concealing patterns modeling confidential knowledge by limiting their frequency, so that they are not easily mined from the data. Existing methods are applied to: (I) a collection of set-valued data (transactions) [35] or spatiotemporal data (trajectories) [1]; (II) a collection of sequences [16, 18]; or (III) a single sequence [6, 21, 36]. Yet, none of these methods follows our CSD setting: Methods in category I are not applicable to string data, and those in categories II and III do not have guarantees on privacy-related constraints [36] or on utility-related properties [16, 18, 6, 21]. Specifically, unlike our approach, [36] cannot guarantee that all sensitive patterns are concealed (constraint C1), while [16, 18, 6, 21] do not guarantee the satisfaction of utility properties (e.g., and P2).
Data anonymization is a different direction in privacy-preserving data mining which is applied to individual-specific data and aims to prevent the disclosure of individuals’ identity and/or information that individuals are not willing to be associated with [3, 25, 14]. On the other hand, our approach is applied to a string modeling information that does not necessarily refer to specific individuals and aims to protect sensitive patterns that model confidential knowledge rather than values individuals do not want to be associated with. For example, our approach may be applied to a string comprised of letters corresponding to orders of different products by a business. In this case, subsequences of ordered products that provide competitive advantage to the business [18] are treated as sensitive patterns and should be concealed from the disseminated string. The fact that anonymization methods deal with individual-specific data and aim to prevent privacy threats other than confidential knowledge exposure leads to fundamentally different protection principles and methods than ours. For instance, differential privacy [14] is a well-known anonymization principle and anonymization methods based on condensation [3] have been proposed for strings [3, 2]. Our work is related to anonymization approaches in that it shares the general objective of protecting string data with [3, 2] and that of protecting data while supporting string mining with the works of [7] and [10]. However, our work considers different input data and has a fundamentally different privacy objective than [3, 2, 7, 10]. Specifically, these works consider a collection of strings instead of a single long string and employ privacy objectives which do not aim to reduce the frequency of sensitive length- substrings to zero. Therefore, they cannot be applied to address the problems considered in this paper.
9 Conclusion
In this paper, we introduced the Combinatorial String Dissemination model. The focus of this model is on guaranteeing privacy-utility trade-offs in sequential data (*e.g., *C1 vs. and P2).
Under this model, we considered two different settings. The common privacy constraint in both settings is that the output string must not contain any sensitive pattern. In the first setting, we aim to generate the minimal-length string that preserves the order of appearance and the frequency of all non-sensitive patterns. We defined a problem, TFS, to capture these requirements, and a variant of it, PFS, that preserves a partial order and the frequency of the non-sensitive patterns but generally produces a shorter string. We developed two time-optimal algorithms, TFS-ALGO and PFS-ALGO, for TFS and PFS, respectively. We also developed MCSR-ALGO, a heuristic that prevents the disclosure of the location of sensitive patterns, ensuring that sensitive patterns are not reinstated, implausible patterns are not introduced, and occurrences of spurious patterns are prevented from the outputs of TFS-ALGO and PFS-ALGO. In the second setting, we aim to generate a string that is at minimal edit distance from the original string, in addition to preserving the order of appearance and the frequency of all non-sensitive patterns. We defined a problem, ETFS, to capture these requirements, and proposed ETFS-ALGO, an algorithm, which is based on solving specific instances of approximate regular expression matching, to construct such a string.
Our experiments show that string sanitization by TFS-ALGO, PFS-ALGO and then MCSR-ALGO is both effective and efficient. They also demonstrate that TFS-ALGO can be employed as an effective heuristic to the ETFS problem producing optimal or near-optimal solutions in practice.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Abul, O., Bonchi, F., Giannotti, F.: Hiding sequential and spatiotemporal patterns. TKDE 22 (12), 1709–1723 (2010)
- 2[2] Aggarwal, C.C., Yu, P.S.: On anonymization of string data. In: SDM. pp. 419–424 (2007)
- 3[3] Aggarwal, C.C., Yu, P.S.: A framework for condensation-based anonymization of string data. DMKD 16 (3), 251–275 (2008)
- 4[4] Almirantis, Y., Charalampopoulos, P., Gao, J., Iliopoulos, C.S., Mohamed, M., Pissis, S.P., Polychronopoulos, D.: On avoided words, absent words, and their application to biological sequence analysis. Algorithms for molecular biology : AMB 12 (2017)
- 5[5] Bernardini, G., Chen, H., Conte, A., Grossi, R., Loukides, G., Pisanti, N., Pissis, S.P., Rosone, G.: String sanitization: A combinatorial approach. In: ECML/PKDD (2019), https://ecmlpkdd 2019.org/downloads/paper/73.pdf
- 6[6] Bonomi, L., Fan, L., Jin, H.: An information-theoretic approach to individual sequential data sanitization. In: WSDM. pp. 337–346 (2016)
- 7[7] Bonomi, L., Xiong, L.: A two-phase algorithm for mining sequential patterns with differential privacy. In: CIKM. pp. 269–278 (2013)
- 8[8] Brendel, V., Beckmann, J.S., Trifonov, E.N.: Linguistics of nucleotide sequences: Morphology and comparison of vocabularies. Journal of Biomolecular Structure and Dynamics 4 (1), 11–21 (1986)
