
TL;DR
This paper introduces a residual hashing algorithm for dynamical distance sets and presents a data structure that accelerates its computations, aiming to improve efficiency in distance matrix processing.
Contribution
It proposes a novel residual hashing function and an optimized data structure to enhance computation speed in distance matrix hashing.
Findings
Hashing algorithm effectively handles dynamical distance sets
Data structure significantly accelerates hashing computations
Residual hashing improves accuracy or efficiency
Abstract
Hashing algorithm of dynamical set of distances is described. Proposed hashing function is residual. Data structure which implementation accelerates computations is presented
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
Abstract
Hashing algorithm of dynamical set of distances is described. Proposed hashing function is residual. Data structure which implementation accelerates computations is presented.
1 Introduction
Unweighted Pair Group Method with Centroid distance minimization [1] or UPGMC is one of existing nonparametric clusterization algorithm. It starts with points (interpreted as clusters) in some coordinate (Euclidian) space, and on each of steps it merges closest pair of clusters into cluster which have coordinates of mass center of all points belonging to the pair. Details may be found in [1].
Simple program implementation of this algorithm recomputes distance matrix (elements above or under main diagonal) which shrinks on each step, this approach requires distance computations, more precisely required number is tetrahedral .
Other possible way is to update set of actual distances without repeating already made calculations. It requires computing of initial distances and distances between merging clusters. This sum ups to i.e. distance computations required.
Problem is to design data structure such that operations of distances updating and deletion would take reasonable time. This paper describes hashing (partition) [2] of dynamic set of distances instead of using matrix data type.
2 Description of data structure
Let us enumerate points by consecutive numbers
Each element of dynamic set of distances is triple where is distance between points indexed by It is assumed that in any triple first element is less than second element. Dynamic set of distances is implemented as list of fixed length Each element of is dynamic list (slot) Thus, is:
[TABLE]
Triple belongs to slot if:
[TABLE]
This is one of possible hashing functions on set containing ordered pairs of indexes. In other words pair uniquely defines which is index of slot We use this partition of triples for acceleration of look-up, insertion and deletion in .
Consider slot for some . Let it consist of following triples:
[TABLE]
Program implementation which creates and updates such that first elements of triples are sorted in following way
[TABLE]
allows to use binary search within slot. For example, if triple should be deleted:
find slot index by (1) 2. 2.
within found slot allocate first and last occurences of triples with first element equal to by binary search 3. 3.
sequential search of within allocated sublist
Adding of element is similar and should satisfy conditions (1), (2). It can be inserted at a position of first occurence from step 2 if found.
If creation and updating of list additionally keeps nondecreasing sorting with respect to second component for all triples with same first component within any slot then it allows to use binary search on step 3. Theoretically, it can improve performance for large
3 Conclusions
Multiple executions of UPGMC with described hashing show considerable decrease of overall runtime compared to simple implementation mentioned in introduction. Overall runtime depends on number of slots Number depends on and computational architecture. Parallelization of hashing algorithm is possible.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P.H. Sneath, R.R. Sokal. The Principles and Practice of Numerical Classification, Freeman and Company, 1973.
- 2[2] P. Graham, D. Knuth, O. Patashnik. Concrete Mathematics: A Foundation for Computer Science, Addison-Wesley, 1994.
