Neural Locality Sensitive Hashing for Entity Blocking
Runhui Wang, Luyang Kong, Yefan Tao, Andrew Borthwick, Davor Golac,, Henrik Johnson, Shadie Hijazi, Dong Deng, Yongfeng Zhang

TL;DR
This paper introduces NLSHBlock, a neural network-based locality-sensitive hashing method that uses pre-trained language models and a novel loss function to improve entity blocking in complex, real-world scenarios.
Contribution
It proposes a neuralization approach for LSH functions using deep neural networks and fine-tuned language models, addressing limitations of traditional LSH in customized similarity metrics.
Findings
NLSHBlock outperforms existing LSH methods on real-world datasets.
The approach improves entity matching performance in semi-supervised settings.
Extensive evaluations demonstrate significant performance gains.
Abstract
Locality-sensitive hashing (LSH) is a fundamental algorithmic technique widely employed in large-scale data processing applications, such as nearest-neighbor search, entity resolution, and clustering. However, its applicability in some real-world scenarios is limited due to the need for careful design of hashing functions that align with specific metrics. Existing LSH-based Entity Blocking solutions primarily rely on generic similarity metrics such as Jaccard similarity, whereas practical use cases often demand complex and customized similarity rules surpassing the capabilities of generic similarity metrics. Consequently, designing LSH functions for these customized similarity rules presents considerable challenges. In this research, we propose a neuralization approach to enhance locality-sensitive hashing by training deep neural networks to serve as hashing functions for complex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Malware Detection Techniques · Domain Adaptation and Few-Shot Learning
MethodsALIGN
