Note on distance matrix hashing

I.A. Junussov

arXiv:1901.09505·cs.DS·January 30, 2019

Note on distance matrix hashing

I.A. Junussov

PDF

Open Access

TL;DR

This paper introduces a residual hashing algorithm for dynamical distance sets and presents a data structure that accelerates its computations, aiming to improve efficiency in distance matrix processing.

Contribution

It proposes a novel residual hashing function and an optimized data structure to enhance computation speed in distance matrix hashing.

Findings

01

Hashing algorithm effectively handles dynamical distance sets

02

Data structure significantly accelerates hashing computations

03

Residual hashing improves accuracy or efficiency

Abstract

Hashing algorithm of dynamical set of distances is described. Proposed hashing function is residual. Data structure which implementation accelerates computations is presented

Equations8

S_{0} \to S_{1} \to \dots \to S_{l - 1} .

S_{0} \to S_{1} \to \dots \to S_{l - 1} .

j = (id_{m} + id_{s}) mod l .

j = (id_{m} + id_{s}) mod l .

(id_{x_{1}}, id_{y_{1}}, d_{x_{1} y_{1}}) \to (id_{x_{2}}, id_{y_{2}}, d_{x_{2} y_{2}}) \to ... \to (id_{x_{p}}, id_{y_{p}}, d_{x_{p} y_{p}}) .

(id_{x_{1}}, id_{y_{1}}, d_{x_{1} y_{1}}) \to (id_{x_{2}}, id_{y_{2}}, d_{x_{2} y_{2}}) \to ... \to (id_{x_{p}}, id_{y_{p}}, d_{x_{p} y_{p}}) .

id_{x_{1}} \leq id_{x_{2}} \leq \dots \leq id_{x_{p}}

id_{x_{1}} \leq id_{x_{2}} \leq \dots \leq id_{x_{p}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques

Full text

Abstract

Hashing algorithm of dynamical set of distances is described. Proposed hashing function is residual. Data structure which implementation accelerates computations is presented.

1 Introduction

Unweighted Pair Group Method with Centroid distance minimization [1] or UPGMC is one of existing nonparametric clusterization algorithm. It starts with $n$ points (interpreted as clusters) in some coordinate (Euclidian) space, and on each of $n-1$ steps it merges closest pair of clusters into cluster which have coordinates of mass center of all points belonging to the pair. Details may be found in [1].

Simple program implementation of this algorithm recomputes distance matrix (elements above or under main diagonal) which shrinks on each step, this approach requires $O(n^{3})$ distance computations, more precisely required number is tetrahedral $T_{n-1}={n+1\choose 3}$ .

Other possible way is to update set of actual distances without repeating already made calculations. It requires computing of ${n\choose 2}$ initial distances and $n-2+n-3+\ldots+1={n-1\choose 2}$ distances between merging clusters. This sum ups to $(n-1)^{2},$ i.e. $O(n^{2})$ distance computations required.

Problem is to design data structure such that operations of distances updating and deletion would take reasonable time. This paper describes hashing (partition) [2] of dynamic set of distances instead of using matrix data type.

2 Description of data structure

Let us enumerate points by consecutive numbers $\mathrm{id}_{1},\mathrm{id}_{2},\ldots,\mathrm{id}_{k},k<\infty.$

Each element of dynamic set of distances is triple $(\mathrm{id}_{m},\mathrm{id}_{s},d_{ms})$ where $d_{ms}$ is distance between points indexed by $\mathrm{id}_{m},\mathrm{id}_{s}.$ It is assumed that in any triple first element is less than second element. Dynamic set of distances is implemented as list $L$ of fixed length $l.$ Each element of $L$ is dynamic list (slot) $S_{j},j=0,\ldots,l-1.$ Thus, $L$ is:

[TABLE]

Triple $(\mathrm{id}_{m},\mathrm{id}_{s},d_{ms})$ belongs to slot $S_{j}$ if:

[TABLE]

This is one of possible hashing functions on set containing ordered pairs of indexes. In other words pair $(\mathrm{id}_{m},\mathrm{id}_{s})$ uniquely defines $j$ which is index of slot $S_{j}\ni(\mathrm{id}_{m},\mathrm{id}_{s},d_{ms}).$ We use this partition of triples for acceleration of look-up, insertion and deletion in $L$ .

Consider slot $S_{j}$ for some $j=0,1,\ldots,l-1$ . Let it consist of following triples:

[TABLE]

Program implementation which creates and updates $S_{j}$ such that first elements of triples are sorted in following way

[TABLE]

allows to use binary search within slot. For example, if triple $(\mathrm{id}_{m},\mathrm{id}_{s},d_{ms})$ should be deleted:

find slot index by (1) 2. 2.

within found slot allocate first and last occurences of triples with first element equal to $\mathrm{id}_{m}$ by binary search 3. 3.

sequential search of $(\mathrm{id}_{m},\mathrm{id}_{s},d_{ms})$ within allocated sublist

Adding of element is similar and should satisfy conditions (1), (2). It can be inserted at a position of first occurence from step 2 if found.

If creation and updating of list additionally keeps nondecreasing sorting with respect to second component for all triples with same first component within any slot then it allows to use binary search on step 3. Theoretically, it can improve performance for large $n.$

3 Conclusions

Multiple executions of UPGMC with described hashing show considerable decrease of overall runtime compared to simple implementation mentioned in introduction. Overall runtime depends on number of slots $l.$ Number $l$ depends on $n$ and computational architecture. Parallelization of hashing algorithm is possible.

Bibliography2

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P.H. Sneath, R.R. Sokal. The Principles and Practice of Numerical Classification, Freeman and Company, 1973.
2[2] P. Graham, D. Knuth, O. Patashnik. Concrete Mathematics: A Foundation for Computer Science, Addison-Wesley, 1994.