DiskJoin: Large-scale Vector Similarity Join with SSD
Yanqi Chen, Xiao Yan, Alexandra Meliou, Eric Lo

TL;DR
DiskJoin is a novel disk-based similarity join algorithm that efficiently processes billion-scale vector datasets on a single machine by optimizing disk I/O, caching, and pruning techniques, significantly outperforming existing methods.
Contribution
It introduces the first disk-based similarity join algorithm tailored for large-scale vector data on a single machine, combining optimized disk access, caching, and probabilistic pruning.
Findings
Achieves 50x to 1000x speedup over alternatives.
Effectively processes billion-scale datasets on a single machine.
Reduces disk I/O bottleneck through tailored data access patterns.
Abstract
Similarity join--a widely used operation in data science--finds all pairs of items that have distance smaller than a threshold. Prior work has explored distributed computation methods to scale similarity join to large data volumes but these methods require a cluster deployment, and efficiency suffers from expensive inter-machine communication. On the other hand, disk-based solutions are more cost-effective by using a single machine and storing the large dataset on high-performance external storage, such as NVMe SSDs, but in these methods the disk I/O time is a serious bottleneck. In this paper, we propose DiskJoin, the first disk-based similarity join algorithm that can process billion-scale vector datasets efficiently on a single machine. DiskJoin improves disk I/O by tailoring the data access patterns to avoid repetitive accesses and read amplification. It also uses main memory as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
