DiskJoin: Large-scale Vector Similarity Join with SSD

Yanqi Chen; Xiao Yan; Alexandra Meliou; Eric Lo

arXiv:2508.18494·cs.DB·October 13, 2025

DiskJoin: Large-scale Vector Similarity Join with SSD

Yanqi Chen, Xiao Yan, Alexandra Meliou, Eric Lo

PDF

TL;DR

DiskJoin is a novel disk-based similarity join algorithm that efficiently processes billion-scale vector datasets on a single machine by optimizing disk I/O, caching, and pruning techniques, significantly outperforming existing methods.

Contribution

It introduces the first disk-based similarity join algorithm tailored for large-scale vector data on a single machine, combining optimized disk access, caching, and probabilistic pruning.

Findings

01

Achieves 50x to 1000x speedup over alternatives.

02

Effectively processes billion-scale datasets on a single machine.

03

Reduces disk I/O bottleneck through tailored data access patterns.

Abstract

Similarity join--a widely used operation in data science--finds all pairs of items that have distance smaller than a threshold. Prior work has explored distributed computation methods to scale similarity join to large data volumes but these methods require a cluster deployment, and efficiency suffers from expensive inter-machine communication. On the other hand, disk-based solutions are more cost-effective by using a single machine and storing the large dataset on high-performance external storage, such as NVMe SSDs, but in these methods the disk I/O time is a serious bottleneck. In this paper, we propose DiskJoin, the first disk-based similarity join algorithm that can process billion-scale vector datasets efficiently on a single machine. DiskJoin improves disk I/O by tailoring the data access patterns to avoid repetitive accesses and read amplification. It also uses main memory as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.