Scalable high-dimensional indexing and searching with Hadoop
Denis Shestakov, Diana Moise

TL;DR
This paper presents a scalable Hadoop-based system for high-dimensional indexing and searching over massive multimedia datasets, demonstrating efficient performance on billions of descriptors.
Contribution
It introduces a scalable workflow for high-dimensional search on scientific grid environments using Hadoop, capable of handling over 30 billion descriptors.
Findings
Successfully indexed over 30 billion SIFT descriptors.
Achieved stable search throughput of around 210ms per image on 100 million images.
Demonstrated scalability and good search quality in large multimedia collections.
Abstract
While high-dimensional search-by-similarity techniques reached their maturity and in overall provide good performance, most of them are unable to cope with very large multimedia collections. The 'big data' challenge however has to be addressed as multimedia collections have been explosively growing and will grow even faster than ever within the next few years. Luckily, computational processing power has become more available to researchers due to easier access to distributed grid infrastructures. In this paper, we show how high-dimensional indexing and searching methods can be used on scientific grid environments and present a scalable workflow for indexing and searching over 30 billion SIFT descriptors using a cluster running Hadoop. Besides its scalability, the proposed scheme not only provides good search quality, but also achieves a stable throughput of around 210ms per image when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Video Analysis and Summarization
