On a Subset Metric
Richard Castro, Zhibin Chang, Ethan Ha, Evan Hall, Hiren Maharaj

TL;DR
This paper introduces a new metric on finite subsets of a bounded metric space, extending previous subset distance concepts to facilitate error correction in DNA data storage and related applications.
Contribution
It generalizes existing subset metrics, providing a new mathematical framework for analyzing error correction in subset-based data representations.
Findings
Defines a new metric on finite subsets of a bounded metric space
Extends the sequence-subset distance used in DNA data storage
Builds on previous work by Eiter and Mannila on subset distance functions
Abstract
For a bounded metric space X, we define a metric on the set of all finite subsets of X. This generalizes the sequence-subset distance introduced by Wentu Song, Kui Cai and Kees A. Schouhamer Immink to study error correcting codes for DNA based data storage. This work also complements the work of Eiter and Mannila where they study extensions of distance functions to subsets of a space in the context of various applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDNA and Biological Computing · Cooperative Communication and Network Coding
On a Subset Metric
Richard Castro
Richard Castro
Department of Mathematics and Statistics
San Diego State University
5500 Campanile Drive, San Diego, 92182, CA, USA
,
Zhibin Chang
Zhibin Chang
Department of Mathematics and Statistics
San Diego State University
5500 Campanile Drive, San Diego, 92182, CA, USA
,
Ethan Ha
Ethan Ha
Department of Mathematics and Statistics
San Diego State University
5500 Campanile Drive, San Diego, 92182, CA, USA
,
Evan Hall
Evan Hall
Department of Mathematics and Statistics
San Diego State University
5500 Campanile Drive, San Diego, 92182, CA, USA
and
Hiren Maharaj
Hiren Maharaj
Department of Mathematics and Statistics
San Diego State University
5500 Campanile Drive, San Diego, 92182, CA, USA
Abstract.
For a bounded metric space , we define a metric on the set of all finite subsets of . This generalizes the sequence-subset distance introduced by Wentu Song, Kui Cai and Kees A. Schouhamer Immink [7] to study error correcting codes for DNA based data storage. This work also complements the work of Eiter and Mannila [3] where they study extensions of distance functions to subsets of a space in the context of various applications.
1. Introduction
To design error correcting codes for DNA storage channels, a new metric, called the sequence-subset distance, was introduced in [7]. This metric generalizes the Hamming distance to a distance function defined between any two sets of unordered vectors. The definition is as follows. Let be a fixed finite alphabet and an integer. For any , the Hamming distance between and is the number of coordinates in which and differ. For two subsets , with , and any injection , the distance between and is defined to be
[TABLE]
The sequence-subset distance between and is defined to be
[TABLE]
In [7] it is shown that is in fact a metric on the set of subsets of .
In this note we generalize the sequence-subset distance as follows. Let be a bounded metric space. For each , let be a function such that
[TABLE]
for all . Put , the set of all finite subsets of . For , with , and any injection , the distance between and to defined to be
[TABLE]
Now the distance between and is defined to be
[TABLE]
We show in Section 2 that is indeed a metric on . We will refer to this distance function simply as a subset metric.
There is some flexibility in the choice of the function . Since is a bounded metric space, we can select the function to have constant value . In the case of the Hamming metric on , this is tantamount to choosing to be the constant for all and the subset-sequence metric of [7] is recovered. In fact could be be any constant valued function whose value is an upper bound for the metric on . Alternatively, one could define as follows: for each , let
[TABLE]
Condition (2) is satisfied: for all , whence .
As for the sequence-subset distance of [7], the subset distance between and can be computed from a minimum weight perfect matching of the bipartite graph whose partite sets are and ; the edge joining with is assigned weight . The Kuhn-Munkres algorithm does this in time [4].
The generalized metric could potentially have more applications. For example, take to be the vertex set of a finite connected graph and the length of the shortest path between and . Then is a metric on the power set and provides a measure of distance between collections of vertices.
Another example is image recognition. In this case take to be a bounded subset of the standard Euclidean plane (for example, corresponding to a raster of pixels). For simplicity we take the unit square as an example and is the standard Euclidian distance. Each finite subset of would correspond to an image. Using (4) to define the function , we have where are the four corners of . Alternatively, could be replaced by the constant function whose value is .
Distance functions between subsets of a metric space and also measure spaces have been widely studied, see [1] for a survey of such distances; see also [2]. One of the most widely used subset metrics is the Hausdorff metric [1]. This metric has many variations, but we state one version for comparison. Let be a bounded metric space with metric . For non-empty compact subsets of , define
[TABLE]
where and is defined likewise. The function gives a metric on the set of all compact subsets of that generalizes : for all . If is finite, the Hausdorff metric is computable in polynomial time and does have theoretical benefits, for example, it is complete if is complete with respect to . However, as pointed out in [3], it may not be appropriate for some applications since the metric does not take into account the entire configuration of some finite sets. On the other hand, the subset-sequence metric formulated in [7] for the purpose of comparing of DNA sequences provides a finer comparison between two collections of sequences and is thus a more appropriate distance measure in that situation. Each term involving on the right side of (1) expresses a natural worst case weight for a DNA strand that is too far away from the other set. While the authors of this work were primarily motivated by generalizing the work of [7], this work also complements that of [3] where they study extensions of distance measures to subsets more generally. For comparison, we briefly recall some of the main results from [3]. A distance function on a non-empty set is one that satisfies all of the axioms to be a metric, except possibly the triangle inequality. In [3], the authors consider the problem of extending a distance function to the set of non-empty finite subsets of . They also discuss algorithms for computing such extensions. To measure a distance between two non-empty subsets of , they discuss four distance functions: the sum of minimum distances [5]
[TABLE]
the surjective distance
[TABLE]
where the minimum is over all surjections from the larger set to the smaller set (due to G. Oddie in [6]), the Fair surjection distance
[TABLE]
where the minimum is over all fair surjections from the larger set to the smaller set (a surjection is called fair if for all ; this is also due to G. Oddie in [6]) and they introduce the Link distance
[TABLE]
where the minimum is over all linking relations between and (a subset is called a linking relation if for all , there exists such that and also if for all , there exists such that ). While they show that these distance functions fail to be a metric in the case that is a finite subset of the integral plane and is the Manhattan metric, Eiter and Mannila present an elegant construction, called the metric infimum method, that produces a metric from a given distance function . Interestingly, they demonstrate that . The authors in [3] argue that the link metric is very intuitive in some contexts. It would interesting to also study this metric in the context of error correcting codes for DNA data storage.
The rest of the paper is devoted to proving that (3) is indeed a metric.
2. Proofs
Thoughout this section is a bounded metric space with metric , the function is one that satisfies the condition (2), is the function defined by (3) and is the set of all finite subsets of . In this section we prove that the function is a metric on . While the main steps followed here are inspired by [7], there are differences to account for the presence of the function in the definition of .
Lemma 1**.**
For any , such that , there exists an injection , such that and for all .
Proof.
If , then the statement is vacuously true. Suppose that . Choose such that . The proof will be in two parts. First we show that, if necessary, can be redefined on so that and is contained in the image of . Next we will show that can be further adjusted to have the desired properties.
Suppose that some does not belong to the image of . Then we redefine at to form a new embedding by setting
[TABLE]
By definition . Note that and
[TABLE]
Since , and , it follows that
[TABLE]
Combining (5) and (6), we get that
[TABLE]
From the condition (2), it follows that . Thus and . By repeatedly applying the above procedure we will obtain an embedding of into , which we also call , with the property that .
Let . Next we show that if then we can adjust the embedding to form a new embedding such that we have and still have that . From above we know that there exists such that . Put and define
[TABLE]
Then is an injection and, by the definition of the subset distance, . Also we have that
[TABLE]
where the last inequality follows from the triangle inequality. Thus and we see that and . By repeated application of the above procedure, we obtain an embedding with the desired property. ∎
Corollary 1**.**
For any ,
[TABLE]
Proof.
This is a direct consequence of Lemma 1 and the definition of . ∎
Lemma 2**.**
Suppose that with . Then for any , .
Proof.
Suppose such that . If , then . If , then for some and . Fix and define by
[TABLE]
Then so is the disjoint union and
[TABLE]
since by condition (2). ∎
By repeated application of the above result, we obtain the following corollary.
Corollary 2**.**
For any , such that . Suppose that such that . Then
[TABLE]
Theorem 1**.**
* is a metric on .*
Proof.
For two finite sets and we denote by the set of injections . Let . By definition of we have that . We show that iff . We may assume that , and let be such that . Then iff iff for all iff for all iff .
Thus, we need only to show that satisfies the Triangle Inequality. Let . We will show that by considering various cases. Note that we are still assuming that , and that is such that .
Case 1: Suppose that . Let and , be such that and . We may assume that
[TABLE]
where and for and for . Then
[TABLE]
Let . Then
[TABLE]
by condition (2)
Case 2: Suppose . Let and be such that and . We may assume that
[TABLE]
where and for and for . Then
[TABLE]
Define by for . Then
[TABLE]
where the last inequality follows from by condition (2).
Case 3: Suppose .
Fix a subset of of cardinality equal to . Then from Case 1, it follows that . From Corollary 2 we know that and . Thus .
∎
Remark 1**.**
If contains at least two elements, then the function never takes on the value 0. In fact, there exists a constant such that for all : from (2), . Thus for all . Put . If , then the inequlity implies that for all , contradicting that contains at least two elements. Thus is the required constant.
Remark 2**.**
If is a Cauchy sequence in , it can be shown that for all sufficiently large: let be as in Remark 1. Then there exists such that for all . Since for all , it follows that for all .
Remark 3**.**
If the topology induced the metric on is the discrete topology, then is complete with respect to the subset metric. However, this is not the case in general. Consider the case where , is the usual Euclidean metric and . Put for all . Then is Cauchy sequence that does not converge: if did converge, using Lemma 1 and Remark 2, it would converge to a set of the form for some . But as , so must equal to [math]. But if , then as , a contradiction.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Aura Conci and Carlos Kubrusly. Distances between sets—a survey. Adv. Math. Sci. Appl. , 26(1):1–18, 2017.
- 2[2] Michel Marie Deza and Elena Deza. Encyclopedia of distances . Springer-Verlag, Berlin, 2009. With 1 CD-ROM (Windows, Macintosh and UNIX).
- 3[3] Thomas Eiter and Heikki Mannila. Distance measures for point sets and their computation. Acta Inform. , 34(2):109–133, 1997.
- 4[4] James Munkres. Algorithms for the assignment and transportation problems. J. Soc. Indust. Appl. Math. , 5:32–38, 1957.
- 5[5] Ilkka Niiniluoto. Truthlikeness , volume 185 of Synthese Library. Springer Dordrecht, 1987.
- 6[6] Ilkka Niiniluoto and Raimo Tuomela, editors. The logic and epistemology of scientific change . Societas Philosophica Fennica, Helsinki, 1979. Acta Philos. Fenn. 3 0 (1978), no. 2-4 (1979).
- 7[7] Wentu Song, Kui Cai, and Kees A. Schouhamer Immink. Sequence-subset distance and coding for error control in DNA-based data storage. IEEE Trans. Inform. Theory , 66(10):6048–6065, 2020.
