Vector and Line Quantization for Billion-scale Similarity Search on GPUs

Wei Chen; Jincai Chen; Fuhao Zou; Yuan-Fang Li; Ping Lu; Qiang Wang,; Wei Zhao

arXiv:1901.00275·cs.CV·July 30, 2019

Vector and Line Quantization for Billion-scale Similarity Search on GPUs

Wei Chen, Jincai Chen, Fuhao Zou, Yuan-Fang Li, Ping Lu, Qiang Wang,, Wei Zhao

PDF

1 Repo

TL;DR

This paper introduces VLQ-ADC, a novel vector and line quantization-based hierarchical inverted index for billion-scale approximate nearest neighbor search on GPUs, achieving higher accuracy and speed with comparable memory use.

Contribution

The paper proposes a new hierarchical inverted index structure using vector and line quantization, improving search efficiency and accuracy for billion-scale datasets on GPUs.

Findings

01

VLQ-ADC outperforms state-of-the-art systems in accuracy and speed.

02

The method maintains comparable memory consumption to existing approaches.

03

Extensive evaluation on SIFT1B and DEEP1B datasets validates effectiveness.

Abstract

Billion-scale high-dimensional approximate nearest neighbour (ANN) search has become an important problem for searching similar objects among the vast amount of images and videos available online. The existing ANN methods are usually characterized by their specific indexing structures, including the inverted index and the inverted multi-index structure. The inverted index structure is amenable to GPU-based implementations, and the state-of-the-art systems such as Faiss are able to exploit the massive parallelism offered by GPUs. However, the inverted index requires high memory overhead to index the dataset effectively. The inverted multi-index structure is difficult to implement for GPUs, and also ineffective in dealing with database with different data distributions. In this paper we propose a novel hierarchical inverted index structure generated by vector and line quantization…

Tables4

Table 1. Table 1: Commonly used notations.

Notation	Description
$x_{i}$ , $D$	data points, their dimension and the number of data points
$𝒳$ , $N$	a set of data points and its size, $𝒳 = {x_{1}, \dots, x_{N}} \subset ℝ^{D}$
$c, s, l (c, s)$	centroids, nodes and edges
$m$	encoding length
$k$	the number of first-level centroids
$n$	the number of edges of each first-level centroid
$w_{1}$	the number of first-layer nearest regions for a query
$α$	the portion of the nearest of the $w \cdot n$ second-level regions
$w_{2}$	the number of second-level nearest regions for a query, $w_{2} = α \cdot w_{1} \cdot n$
$λ$	a scalar parameter for line quantization
$r$	displacement from data points to the approximate points

Table 2. Table 2: A summary of current state-of-the-art retrieval systems based on quantization method. N 𝑁 N is the size of the dataset 𝒳 𝒳 \mathcal{X} , m 𝑚 m is the number of sub-vectors in product quantizatino (PQ), k 𝑘 k is the size of the codebook, and n 𝑛 n is the number of second-level regions. In the last column of each row, the first term is the complexity for encoding, and the second term is the complexity for indexing.

System	Index structure	Encoding	CPU/GPU	Space complexity
Faiss [6]	VQ	PQ	GPU	$𝒪 (N \cdot m) + 𝒪 (k \cdot D)$
Ivf-hnsw [7]	2-level VQ	PQ	CPU	$𝒪 (N \cdot m) + 𝒪 (k \cdot (D + n))$
Multi-D-ADC [16]	IMI (PQ)	PQ	CPU	$𝒪 (N \cdot m) + 𝒪 (k \cdot (D + k))$
VLQ-ADC (our system)	VLQ	PQ	GPU	$𝒪 (N \cdot m) + 𝒪 (k \cdot (D + n))$

Table 3. Table 3: Performance comparison between VLQ-ADC (with the re-ranking step) against three other state-of-the-art retrieval systems of recall@1/10/100 and retrieval time on two public datasets. For each system the number of total regions is specified beneath each system’s name. VLQ-ADC consistently achieves higher recall values and significantly lower query time than all other systems. Best result in each column is bolded , and second best is underlined . For the two GPU-based systems, Faiss and VLQ-ADC, we experiments are performed on 1 GPU for 8-byte encoding length, and on 2 GPUs for 16-byte encoding length.

System	SIFT1B								DEEP1B
	8 bytes				16 bytes				8 bytes				16 bytes
	R@1	R@10	R@100	t (ms)	R@1	R@10	R@100	t (ms)	R@1	R@10	R@100	t (ms)	R@1	R@10	R@100	t (ms)
Faiss $2^{18}$	0.1383	0.4432	0.7978	0.31	0.3180	0.7825	0.9618	0.280	0.2101	0.4675	0.7438	0.32	0.3793	0.7650	0.9509	0.33
Ivf-hnsw $2^{16}$	0.1599	0.496	0.778	2.35	0.331	0.737	0.8367	2.77	0.217	0.467	0.7195	2.30	0.3646	0.7096	0.828	3.07
Multi-D-ADC $2^{10} \times 2^{10}$	0.1255	0.4191	0.7843	1.65	0.3064	0.7716	0.9782	8.54	0.1716	0.3834	0.6527	3.28	0.324	0.6918	0.9258	6.152
Multi-D-ADC $2^{12} \times 2^{12}$	0.1420	0.4720	0.8183	0.367	0.3324	0.8029	0.9752	1.603	0.1874	0.4240	0.6979	0.839	0.3557	0.7087	0.9059	1.52
VLQ-ADC $2^{16}$	0.1620	0.507	0.829	0.054	0.345	0.8033	0.9400	0.068	0.2227	0.4855	0.7559	0.059	0.394	0.7644	0.9272	0.067

Table 4. Table 4: The memory consumption of all systems for SIFT1B of 10 9 superscript 10 9 10^{9} 128-dimensional data points.

System (codebook size)	Memory consumption (GB)
Faiss ( $2^{18}$ )	14
Ivf-hnsw ( $2^{16}$ )	13.04
Multi-D-ADC ( $2^{12} \times 2^{12}$ )	12.25
VLQ-ADC ( $2^{16}$ )	13.55

Equations37

E = x \in X \sum ∥ x - q_{v} (x) ∥^{2} .

E = x \in X \sum ∥ x - q_{v} (x) ∥^{2} .

q_{v} (x) = ar g c \in C min ∥ x - c ∥ .

q_{v} (x) = ar g c \in C min ∥ x - c ∥ .

C = C^{1} \times \dots \times C^{m} .

C = C^{1} \times \dots \times C^{m} .

q_{p} (x) = (q_{p}^{1} (x^{1}), \dots, q_{p}^{m} (x^{m})),

q_{p} (x) = (q_{p}^{1} (x^{1}), \dots, q_{p}^{m} (x^{m})),

∥ y - x ∥^{2} \approx∥ y - q_{p} (x) ∥^{2} = i = 1 \sum m ∥ y^{i} - c_{j_{i}}^{i} ∥^{2}

∥ y - x ∥^{2} \approx∥ y - q_{p} (x) ∥^{2} = i = 1 \sum m ∥ y^{i} - c_{j_{i}}^{i} ∥^{2}

q_{l} (x) = ar g l (c_{i}, c_{j}) min d (x, l (c_{i}, c_{j})),

q_{l} (x) = ar g l (c_{i}, c_{j}) min d (x, l (c_{i}, c_{j})),

d (x, l (c_{i}, c_{j}))^{2}

d (x, l (c_{i}, c_{j}))^{2}

+ λ ∥ x - c_{j} ∥^{2}

λ = 0.5 \cdot \frac{( ∥ x - c _{i} ∥ ^{2} + ∥ c _{j} - c _{i} ∥ ^{2} - ∥ x - c _{j} ∥ ^{2} )}{∥ c _{j} - c _{i} ∥ ^{2}} .

λ = 0.5 \cdot \frac{( ∥ x - c _{i} ∥ ^{2} + ∥ c _{j} - c _{i} ∥ ^{2} - ∥ x - c _{j} ∥ ^{2} )}{∥ c _{j} - c _{i} ∥ ^{2}} .

r_{q_{l}} (x) = x - ((1 - λ) \cdot c_{i} + λ \cdot c_{j}) .

r_{q_{l}} (x) = x - ((1 - λ) \cdot c_{i} + λ \cdot c_{j}) .

r_{q} (x) = x - q (x),

r_{q} (x) = x - q (x),

X_{j}^{i} = {x \in X^{i} ∣ q_{l} (x) = l (c_{i}, s_{ij})}, for all i \in {1 \dots k}

X_{j}^{i} = {x \in X^{i} ∣ q_{l} (x) = l (c_{i}, s_{ij})}, for all i \in {1 \dots k}

r_{q_{l}} (x)

r_{q_{l}} (x)

λ_{ij}

∥ y - q_{1} (x) - q_{2} (x - q_{1} (x)) ∥^{2}

∥ y - q_{1} (x) - q_{2} (x - q_{1} (x)) ∥^{2}

∥ y - q_{1} (x)

∥ y - q_{1} (x)

2 ⟨ y, q_{2} (\dots)⟩ .

term1 ∥ y - ((1 - λ_{ij}) c_{i} + λ_{ij} s_{ij}) ∥^{2} + term2 ∥ q_{2} (\dots) ∥^{2} +

term1 ∥ y - ((1 - λ_{ij}) c_{i} + λ_{ij} s_{ij}) ∥^{2} + term2 ∥ q_{2} (\dots) ∥^{2} +

2 (1 - λ_{ij}) term3 ⟨ c_{i}, q_{2} (\dots)⟩ + 2 λ_{ij} term4 ⟨ s_{ij}, q_{2} (\dots)⟩ - term5 2 ⟨ y, q_{2} (\dots)⟩ .

∥ y -

∥ y -

(λ_{ij}^{2} - λ_{ij}) term7 ∥ c_{i} - s_{ij} ∥^{2} + λ_{ij} term8 ∥ y - s_{ij} ∥^{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zjuchenwei/vector-line-quantization
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Vector and Line Quantization for Billion-scale Similarity Search on GPUs

Wei Chen

Jincai Chen

[email protected]

Fuhao Zou

[email protected]

Yuan-Fang Li

Ping Lu

Qiang Wang

Wei Zhao

Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China

Key Laboratory of Information Storage System of Ministry of Education, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China

School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074,China

Faculty of Information Technology, Monash University, Clayton 3800, Australia

Abstract

Billion-scale high-dimensional approximate nearest neighbour (ANN) search has become an important problem for searching similar objects among the vast amount of images and videos available online. The existing ANN methods are usually characterized by their specific indexing structures, including the inverted index and the inverted multi-index structure. The inverted index structure is amenable to GPU-based implementations, and the state-of-the-art systems such as Faiss are able to exploit the massive parallelism offered by GPUs. However, the inverted index requires high memory overhead to index the dataset effectively. The inverted multi-index structure is difficult to implement for GPUs, and also ineffective in dealing with database with different data distributions. In this paper we propose a novel hierarchical inverted index structure generated by vector and line quantization methods. Our quantization method improves both search efficiency and accuracy, while maintaining comparable memory consumption. This is achieved by reducing search space and increasing the number of indexed regions.

We introduce a new ANN search system, VLQ-ADC, that is based on the proposed inverted index, and perform extensive evaluation on two public billion-scale benchmark datasets SIFT1B and DEEP1B. Our evaluation shows that VLQ-ADC significantly outperforms the state-of-the-art GPU- and CPU-based systems in terms of both accuracy and search speed. The source code of VLQ-ADC is available at https://github.com/zjuchenwei/vector-line-quantization.

keywords:

Quantization; Billion-scale similarity search; high dimensional data; Inverted index; GPU

††journal: Future Generation Computer Systems

1 Introduction

In the age of the Internet, the amount of images and videos available online increases incredibly fast and has grown to an unprecedented scale. Google processes over 40,000 various queries per second, and handles more than 400 hours of YouTube video uploads every minute [1]. Every day, more than 100 million photos/videos are uploaded to Instagram, more than 300 million uploaded to Facebook, and a total of 50 billion photos have been shared to Instagram111https://www.omnicoreagency.com/instagram-statistics/. As a result, scalable and efficient search for similar images and videos on the billion scale has become an important problem and it has been under intense investigation.

As online images and videos are unstructured and usually unlabeled, it is hard to compare them directly. A feasible solution is to use real-valued, high-dimensional vectors to represent images and videos, and compare the distances between the vectors to find the nearest ones. Due to the curse of dimensionality [2], it is impractical for multimedia applications to perform exhaustive search in billion-scale datasets. Thus, as an alternative, many approximate nearest neighbor (ANN) search algorithms are now employed to tackle the billion-scale search problem for high-dimensional data. Recent best-performing billion-scale retrieval systems [3, 4, 5, 6, 7, 8] typically utilize two main processes: indexing and encoding.

To avoid expensive exhaustive search, these systems use index structures that can partition the dataset space into a large number of disjoint regions, and the search process only collects points from the regions that are closest to the query point. The collected points then form a short list of candidates for each query point. The retrieval system then calculates the distance between each candidate and the query point, and sort them accordingly.

To guarantee query speed, the indexed points need to be loaded into RAM. For large datasets that do not fit in RAM, dataset points are encoded into a compressed representation. Encoding has also proven to be critical for memory-limited devices such as GPUs that excel at handling data-parallel tasks. A high-performance CPU like Intel Xeon Platinum 8180 (2.5 GHz, 28 cores) performs 1.12 TFLOP/s single precision peak performance222https://ark.intel.com/content/www/us/en/ark/products/120496/intel-xeon-platinum-8180-processor-38-5m-cache-2-50-ghz.html. In contrast, GPUs like NVdia Tesla P100 can provide up to 10T FLOP/s single precision peak performance333https://images.nvidia.com/content/tesla/pdf/nvidia-tesla-p100-PCIe-datasheet.pdf, and are good choices for high performance similarity search systems. Many encoding methods have been proposed, including hashing methods and quantization methods. Hashing methods encode data points to compact binary codes through a hash function [9, 10], and quantization methods, typically product quantization (PQ), map data points to a set of centroids and use the indices of the centroids to encode the data points [11, 12]. By hashing methods, the distance between two data points can be approximated by the Hamming distance between their binary code. By quantization methods, the Euclidean distance between the query and compressed points can be computed efficiently. It has been shown in the literature that quantization encoding can be more accurate than various hashing methods [11, 13, 14].

Jégou et al. [11] first introduced an index structure that is able to handle billion-scale datasets effieciently. It is based on the inverted index structure that partitions the high dimensional vector space into Voronoi regions for a set of centroids obtained by a quantization method called vector quantization (VQ) [15]. This system, called IVFADC, achieves reasonable recall rates in several tens of milliseconds. However, the VQ-based index structure needs to store a large set of full dimensional centroids to produce a huge number of regions, which would require a large amount of memory.

An improved inverted index structure called the inverted multi-index (IMI) was later proposed by Babenko and Lempitsky [16]. The IMI is based on product quantization (PQ), which divides the point space into several orthogonal subspaces and clusters the subspaces into Voronoi regions independently. The Cartesian product of regions in each subspace forms regions in the global point space. The strength of the IMI is that it can produce a huge number of regions with much smaller codebooks than that of the inverted index. Due to the huge number of indexed regions, the point space is finely partitioned and each regions contains fewer points. Hence the IMI can provide accurate and concise candidate lists with memory and runtime efficiency.

However, it has been observed that for some billion-scale datasets, the majority of the IMI regions contain no points [5], which is a waste of index space and has a negative impact on the final retrieval performance. The reasons for this deficiency is that the IMI learns the centroids independently on the subspaces which are not statistically independent [7]. In fact, some convolutional neural networks (CNN) produce feature vectors with considerable correlations between the subspaces [10, 17, 18].

The high level of parallelism provided by GPUs has recently been leveraged to accelerate similarity search of high-dimensional data, and it has been demonstrated that GPU-based systems are more efficient than CPU-based systems by a large margin [6, 4]. Comparing to IMI structure, the inverted indexing structure proposed by Jégou et al. [11] is more straightforward to parallelize, because the IMI structure depends on a complicated multi-sequence algorithm, which is sequential in nature [4] and hard to parallelize.

To the best of our knowledge, there are two high performance systems that are able to handle ANN search for billion-scale datasets on the GPU: PQT [4] and Faiss [6]. PQT proposes a novel quantization method call line quantization (LQ) and is the first billion-scale similarity retrieval system on the GPU. Subsequently Faiss implements the idea of IVFADC on GPUs and currently has the state-of-the-art performance on GPUs. We compare our method against Faiss and two other systems in Section 5.

In this paper, we present VLQ-ADC, a novel billion-scale ANN similarity search framework. VLQ-ADC includes a two-level hierarchical inverted indexing structure based on Vector and Line Quantization (VLQ), which can be implemented on GPU efficiently. The main contributions of our solution are threefold.

We demonstrate how to increase the number of regions with memory efficiency by the novel inverted index structure. The efficient indexing contributes to high accuracy for approximate search. 2. 2.

We describe how to encode data points via a novel algorithm for high runtime/memory efficiency. 3. 3.

Our evaluation shows that our system consistently and significantly outperforms state-of-the-art GPU- and CPU-based retrieval systems on both recall and efficiency on two public billion-scale benchmark datasets with single- and multi-GPU configurations.

The rest of the paper is organized as follows. Section 2 introduces related works on indexing with quantization methods. Section 3 presents VLQ, our approach for approximate nearest neighbor (ANN)-based similarity search method. Section 4 introduces the details of GPU implementation. Section 5 provides a series of experiments, and compares the results to the state of the art.

2 Related work

In this section, we briefly introduce some quantization methods and several retrieval systems related to our approach. Table 1 summarizes the common notations used throughout this paper. For example, we assume that $\mathcal{X}=\{x_{1},\dots,x_{N}\}\subset\mathbb{R}^{D}$ is a finite set of $N$ data points of dimension $D$ .

2.1 Vector quantization (VQ)

In vector quantization [15] (Figure 4 a), a quantizer is a function $q_{v}$ that maps a $D$ -dimensional vector $x$ to a vector $q_{v}(x)\in C$ , where $C$ is a finite subset of $\mathbb{R}^{D}$ , of $k$ vectors. Each vector $c\in C$ is called a centroid, and $C$ is a codebook of size $k$ . We can use Lloyd iterations [19] to efficiently obtain a codebook $C$ on a subset of the dataset. For a finite dataset, $\mathcal{X}$ , $q_{v}(x)$ induces quantization error $E$ :

[TABLE]

According to Lloyd’s first condition, to minimize quantization error a quantizer should map vector $x$ to its nearest codebook centroid.

[TABLE]

Hence, the set of points $\mathcal{X}_{i}=\{x\in\mathbb{R}^{D}\mid q_{v}(x)=c_{i}\}$ is called a cluster or a region for centroid $c_{i}$ .

The inverted index structure based on VQ [11] can split the dataset space into $k$ regions that correspond to the $k$ centroids of the codebook. Since the ratio of regions to centroids is $1\colon\!1$ , it requires a large amount of space to store the $D$ -dimensional centroids when $k$ is large. This would give a negative effect on the performance of the retrieval system. Our hierarchical index structure based on VLQ increase the ratio by $n$ times, i.e., $n$ times more regions can be generated by our indexing structure with the same number of centroids as the VQ based indexing structure.

2.2 Product quantization (PQ)

Product quantization (Figure 4 (b)) is an extension of vector quantization. Assuming that the dimension $D$ is a multiple of $m$ , any vector $x\in\mathbb{R}^{D}$ can be regarded as a concatenation $(x^{1},\cdots,x^{m})$ of $m$ sub-vectors, each of dimension $D/m$ . Suppose that $C^{1},\cdots,C^{m}$ are $m$ codebooks of subspace $\mathbb{R}^{D/m}$ , each owns $k$ $D/m$ -dimensional sub-centroids. A codebook of a product quantizer $q_{p}$ is thus a Cartesian product of sub-codebooks.

[TABLE]

Hence the codebook $C$ contains a total of $k^{m}$ centroids, each is a form of $c=(c^{1},\cdots,c^{m})$ , where each sub-centroid $c^{i}\in C^{i}$ for $i\in\mathcal{M}=\{1,\cdots,m\}$ . A product quantizer $q_{p}$ should minimize the quantization error $E$ defined in Formula 1. Hence, for $x\in\mathbb{R}^{D}$ , the nearest centroid in codebook $C$ is

[TABLE]

where $q^{i}$ is a sub-quantizer of $q$ and $q_{p}^{i}(x)$ is the nearest sub-centroid for sub-vector $x^{i}$ , i.e., the nearest centroid $q_{p}(x)$ for $x$ is the concatenation of the nearest sub-centroids for sub-vector $x^{i}$ .

The inverted multi-index structure (IMI) applies the idea of PQ for indexing and can generate $k^{m}$ regions with $m$ codebooks of $k$ sub-centroids each. The benefit of inverted multi-index is thus it can easily generate a much larger number of regions than that of VQ-based inverted index structure with moderate values of $m$ and $k$ . The drawback of IMI is that it produces a lot of empty regions when the distributions of subspaces are not independent [5]. This will affect the system’s performance when handling datasets which have significant correlations between different subspaces, such as CNN-produced feature point dataset [5].

The PQ-based indexing structure later has been improved by OPQ [20] and LOPQ [12]. OPQ make a rotation on dataset points by a global $D\times D$ rotation matrix and LOPQ rotates the points which belong to the same cell by a same local $D\times D$ rotation matrix to minimize correlations between two subspaces [20]. OPQ and LOPQ can both improve the indexing efficiency of PQ but slow down the query speed by a large margin as well.

Additionally, PQ can also be used to compress datasets. Typically each sub-codebook of PQ contains $256$ sub-centroids and each vector $x$ is mapped to a concatenation of $m$ sub-centroids $(c^{1}_{j_{1}},\cdots,c^{m}_{j_{m}})$ , for $j_{i}$ is a value between $1$ and $256$ . Hence the vector $x$ can be encoded into an $m$ -byte code of sub-centroid index $(j_{1},\cdots,j_{m})$ . With the approximate representation by PQ, the Euclidean distances between the query vector and the large number of compressed vectors can be computed efficiently. According to the ADC procedure [11], the computation is performed based on lookup tables.

[TABLE]

where $y^{i}$ is the $i$ th subvector of a query $y$ . The Euclidean distances between the query sub-vector $y^{i}$ and each sub-centroids $c^{i}_{j_{i}}$ can be precomputed and stored in lookup tables that reduce the complexity of distance computation from $\mathcal{O}(\emph{D})$ to $\mathcal{O}(\emph{m})$ . Due to the high compression quality and efficient distance computation approach, PQ is considered the top choice for compact representation of large-scale datasets[20, 12, 14, 7, 3].

2.3 Line quantization (LQ)

Line quantization (LQ) [4] owns a codebook $C$ of $k$ centroids like VQ. As shown in Figure 4 (c), with any two different centroids $c_{i},c_{j}\in C$ , a line is formed and denoted by $l(c_{i},c_{j})$ . A line quantizer $q_{l}$ quantizes a point $x$ to the nearest line as follows:

[TABLE]

where $d(x,l(c_{i},c_{j}))$ is the Euclidean distance from $x$ to the line $l(c_{i},c_{j})$ , and the set $\mathcal{X}_{i,j}=\{x\in\mathbb{R}^{D}|q_{l}(x)=l(c_{i},c_{j})\}$ is called a cluster or a region for line $l(c_{i},c_{j})$ . The squared distance $d(x,l(c_{i},c_{j}))$ can be calculated as following :

[TABLE]

Because the values of $\parallel x-c_{j}\parallel^{2},\parallel x-c_{i}\parallel^{2},\parallel c_{j}-c_{i}\parallel^{2}$ can be pre-computed between $x$ and all centroids, Equation 7 can be calculated efficiently. The anchor point of $x$ is represented by $(1-\lambda)\cdot c_{i}+\lambda\cdot c_{j}$ , where $\lambda$ is a scalar parameter that can be computed as following:

[TABLE]

When $x$ is quantized to a region of $l(c_{i},c_{j})$ , then the displacement of $x$ from $l(c_{i},c_{j})$ can be computed as following:

[TABLE]

Here we regard $l(c_{i},c_{j})$ and $l(c_{j},c_{i})$ as two different lines. So LQ-based indexing structrue can produce $k\cdot(k-1)$ regions with a codebook of $k$ centroids, The benefit of LQ-based indexing structure is that it can produce many more regions than that of VQ-based regions. However it is considerably more complicated to find the nearest line for a point $x$ when $k$ is large. So we use LQ as an indexing approach with a codebook of a few lines.

2.4 The applications of VQ-based and PQ-based indexing structures for billion-scale dataset

In this subsection we introduce several billion-scale similarity retrieval systems that apply VQ- or PQ-based indexing structure and encoded by PQ, and discuss their strengths and weaknesses.

All the systems discussed below are best-performing, state-of-the-art systems for billion-scale high-dimensional ANN search. Their indexing structure and encoding method are summarized in Table 2. Since all these systems employ the same encoding method based on PQ, we will mainly focus on their indexing structures in the discussions below.

Faiss [6] is a very efficient GPU-based retrieval approach, by realizing the idea of IVFADC [11] on GPUs. Faiss uses the inverted index based on VQ [21] for non-exhaustive search and compresses the dataset by PQ. The inverted index of IVFADC owns a vector quantizer $q$ with a codebook of $k$ centroids. Thus there are $k$ regions for the data space. Each point $x\in\mathcal{X}$ is quantized to a region corresponding to a centroid by a VQ quantizer $q_{v}$ . The displacement of each point from the centroid of a region it belongs to is defined as

[TABLE]

where the displacement $r_{q}(x)$ is encoded by PQ with $m$ codebooks shared by all regions. For each region, an inverted list of data points is maintained, along with PQ-encoded displacements.

The search process of Faiss/IVFADC proceeds as follows:

A query point $y$ is quantized to its $w$ nearest regions, extracting a list of candidates $\mathcal{L}_{c}\subset\mathcal{X}$ which have a high probability of containing the nearest neighbor. 2. 2.

The displacement of the query point $y$ from the centroid of each sub-region is computed as $r_{q}(y)$ . 3. 3.

The distances between $r_{q}(y)$ and PQ-encoded displacements in $\mathcal{L}_{c}$ are then computed according to Formula 5. 4. 4.

Sort the list $\mathcal{L}_{c}$ to be $\mathcal{L}_{s}$ based on the distances computed above. The first points in $\mathcal{L}_{s}$ are returned as the search result for query point $y$ .

Ivf-hnsw [7] is a retrieval system based on a two-level inverted index structure. Ivf-hnsw first splits the data space into $k$ regions like IVFADC. Then each region is further split into several sub-regions that correspond to $n$ sub-centroids. Each sub-centroid of a region can be represented by the centroid of the region and another centroid of a neighbor region. Assume that each region has $n$ neighbor regions, thus each region can be split into $n$ regions. Each data point is first quantized to a region and then further quantized to a sub-region of the region. The displacement of each point from the sub-centroid of a sub-region it belongs to is encoded by PQ. An inverted list of data point is maintained for each sub-regions similar to IVFADC.

The search process of Ivf-hnsw proceeds as follows:

A query point $y$ is quantized to its $w$ first-level nearest regions, giving $w\cdot n$ sub-regions. 2. 2.

Among the $w\cdot n$ sub-regions, $y$ is secondly quantized to $0.5\cdot w\cdot n$ nearest sub-regions, generating a list of candidates $\mathcal{L}_{c}\subset\mathcal{X}$ . 3. 3.

The displacement of the query point $y$ from the sub-centroid of each sub-region is computed as $r_{q}(y)$ . 4. 4.

The distances between $r_{q}(y)$ and PQ-encoded displacements in $\mathcal{L}_{c}$ are then computed according to Formula 5. 5. 5.

The re-ordering process of Ivf-hnsw is similar to IVFADC/Faiss.

Multi-D-ADC [16] is based on the inverted multi-index which is currently the state-of-the-art indexing method for high-dimensional large-scale datasets. An inverted multi-index of Multi-D-ADC usually owns a product quantizer with two sub-quantizers $q^{1},q^{2}$ for subspace $\mathbb{R}^{D/2}$ , each of $k$ sub-centroids. A region in the D-dimensional space is now a Cartesian product of two corresponding subspace regions. So the IMI can produce $k^{2}$ regions. For each point $x=(x^{1},x^{2})\in\mathcal{X}$ , sub-vectors $x^{1},x^{2}\in\mathbb{R}^{D/2}$ are separately quantized to subspace regions of $q^{1}(x^{1}),q^{2}(x^{2})$ respectively, and $x$ is then quantized to the region of $(q^{1}(x^{1}),q^{2}(x^{2}))$ . The displacement of each point $x$ from the centroid $(q^{1}(x^{1}),q^{2}(x^{2}))$ is also encoded by PQ, and an inverted list of points is again maintained for each region.

The search process of Multi-D-ADC proceeds as follows:

For a query point $y=(y^{1},y^{2})$ , The Euclidean distances of each of sub-vectors $y^{1},y^{2}$ to all sub-centroids of $q^{1},q^{2}$ are computed respectively. The distance of $y$ to a region can be computed according to Formula 5 for $m=2$ . 2. 2.

Regions are traversed in ascending order of distance to $y$ by the multi-sequence algorithm [16] to generate a list of candidates $\mathcal{L}_{c}\subset\mathcal{X}$ . 3. 3.

The displacement of the query point $y$ from the centroid $(c^{1},c^{2})$ of each region is computed as $r_{q}(y)$ as well. 4. 4.

The re-ordering process of Multi-D-ADC is similar to IVFADC/Faiss.

The VQ-based indexing structure requires a large full-dimensional codebook to produce regions when $k$ is large. The PQ-based indexing structure are not suitable for all datasets, especially for those produced by convolutional neural networks (CNN) [7]. The novel VQ-based indexing structure proposed by Ivf-hnsw can produce more regions than the prior VQ-based indexing structure. However its performance on the codebook of small size is not good enough. We will discuss that in Sec.5. In comparison, our indexing structure is efficient with a small size of codebook which can accelerate query speed and at the same time is suitable for any dataset irrespective of the presence/absence of correlations between subspaces.

3 The VLQ-ADC System

In this section we introduce our GPU-based similarity retrieval system, VLQ-ADC, that contains a two-layer hierarchical indexing structure based on vector and line quantization and an asymmetric distance computation method. VLQ-ADC incorporates a novel index structure that can index the dataset points efficiently (Sec. 3.1). The indexing and encoding process will be presented in Sec. 3.2, and the querying process is discussed in Sec. 3.3.

Comparing with the existing systems above, One major advantage of our system is that our indexing structure can generate shorter and more accurate candidate list for the query point, which will accelerate query speed by a large margin. Another advantage of our system is that the improved asymmetric distance computation method base on PQ encoding method provide a higher search accuracy. In the remainder of this section we will use Figure 7 to illustrate our framework. We recall that commonly used notations are summarized in Table 1.

3.1 The VLQ-based index structure

For billion-scale datasets with a moderate number of regions (e.g., $2^{16}$ ) produced by vector quantization (VQ), the number of data points in most regions is too large, which negatively affects search accuracy. To alleviate this problem, we propose a hierarchical indexing structure. In our structure, each list is split into several shorter lists, i.e., each region is divided into several subregions, using line quantization (LQ).

Our indexing structure is a two-layer hierarchical structure which consists of two levels of quantizers. The first level contains a vector quantizer $q_{v}$ with a codebook of $k$ centroids. The vector quantizer $q_{v}$ partitions the data point space $\mathcal{X}$ into $k$ regions. The second level contains a line quantizer $q_{l}$ with an $n$ -nearest neighbor ( $n$ -NN) graph. The $n$ -NN graph is a directed graph in which nodes are first-level centroids and edges connect a centroid to its $n$ nearest neighbors. In each first-level region, the line quantizer $q_{l}$ then quantizes each data point to the closest edge in the $n$ -NN graph, thus splitting the region into $n$ second-level regions.

As an example, in the right side of Figure 7, given $n=4$ , the top left first-level region is further divided into 4 subregions by $q_{l}$ , enclosed by solid lines and denoted 1, 2, 3, and 4. Each subregion contains all the data points that are closest to a given edge of the $n$ -NN graph, as calculated by the line quantization $q_{l}$ .

Training the codebook. We use Lloyd iteration in the fashion of the Linde-Buzo-Gray algorithm [15] to obtain the codebook of the VQ quantizer $q_{v}$ . The $n$ -NN graph is then built on the centroids of the codebook.

Memory overhead of indexing structure. One advantage of our indexing structure is its ability to produce substantially more subregions with little additional memory consumption. Same as VQ, our first layer codebook needs $k\cdot D\cdot sizeof(\texttt{float})$ bytes. In addition, for second-level indexing, for each of the $k$ first-layer centroids, the $n$ -NN graph only needs to store (1) the indices of its $n$ nearest neighbors and (2) the distances to its $n$ nearest neighbors, which amounts to $k\cdot n\cdot(sizeof(\texttt{int})+sizeof(\texttt{float}))$ bytes. Note we do not need to store the full-dimensional points. For a typical values of $k=2^{16}$ centroids and $n=32$ subcentroids, the additional memory overhead for storing the graph is $2^{16}\cdot 32\cdot(32+32)$ bits (16 MB), which is acceptable for billion-scale datasets.

One way to produce the subregions is by utilizing vector quantization (VQ) again in each region. However, that would require storing full-dimensional subcentroids and thus consume too much additional memory. For the same configuration ( $k=2^{16}$ centroids and $n=32$ subcentroids) and a dimension of $D=128$ , the additional memory overhead for a VQ-based hierarchical indexing structure would be $2^{16}\cdot 32\cdot 128\cdot sizeof(\texttt{float})$ additional bits (1,024 MB). As can be seen, our VLQ-based hierarchical indexing structure is substantially more compact, only consuming $1/64$ of the memory required by a VQ-based approach for the second-level codebook.

We note that the PQ-based indexing structure requires $\mathcal{O}(k\cdot(D+k))$ memory to maintain the indexing structure (Table 2), which is memory inefficient as it is quadratic in $k$ . This is a limitation of PQ-based indexing structure. In contrast, the space complexity of our hierarchical indexing structure is $\mathcal{O}(k\cdot(D+n))$ , where typically $n\ll k$ ( $n$ is much smaller than $k$ ), hence making our index much more memory efficient.

3.2 Indexing and encoding

In this subsection, we will describe the indexing and encoding process and summarize both processes in Algorithm 1 and 2 respectively.

For our two-level index structure, the indexing process comprises two different quantization procedures, one for each layer. Similar to the IVFADC scheme, each dataset point is quantized by the vector quantizer $q_{v}$ to the first-level regions surrounded by the dotted lines in Figure 7. These regions form a set of inverted lists as search candidates.

We describe the second-level indexing process as follows. Let $\mathcal{X}^{i}$ be a region of $\{x_{1},\ldots,x_{l}\}$ that corresponds to a centroid $c_{i}$ , for $i\in\{1,\ldots,k\}$ . In constructing the $n$ -NN graph, let $S_{i}=\{s_{i1},\ldots,s_{in}\}$ denote the set of the $n$ centroids closest to $c_{i}$ and $l(c_{i},s_{ij})$ denote an edge between $c_{i}$ and $s_{ij}$ , for $j\in\{1,\ldots,n\}$ . The points in $\mathcal{X}^{i}$ are quantized to the subregions by a line quantizer $q_{l}$ with a codebook $\mathcal{E}_{i}$ of $n$ edges $\{l(c_{i},s_{i1}),\ldots,l(c_{i},s_{in})\}$ . Thus the region $\mathcal{X}^{i}$ is split into $n$ subregions $\{\mathcal{X}^{i}_{1},\ldots,\mathcal{X}^{i}_{n}\}$ and each point $x\in\mathcal{X}^{i}$ is quantized to a second-level subregion $\mathcal{X}^{i}_{j}$ . So the entire space $\mathcal{X}$ are divided into $k\times n$ second-level subregions.

[TABLE]

Each data point in the dataset $\mathcal{X}$ is assigned to one of the $k\cdot n$ cells. When the data point $x$ is quantized to the sub-region of edge $l(c_{i},s_{ij})$ , according to the Equation 9 and 8 the displacement of $x$ from the corresponding anchor point can be computed as following:

[TABLE]

.

As shown in Algorithm 2, the value of $r_{q_{l}}(x)$ is first computed by Equation 12 and encoded into $m$ bytes using PQ [11]. The PQ codebooks are denoted by $C^{1},\ldots,C^{m}$ , each containing 256 sub-centroids. The vector $r_{q_{l}}(x)$ is mapped to a concatenation of $m$ sub-centroids $(c^{1}_{j_{1}},\cdots,c^{m}_{j_{m}})$ , for $j_{i}$ is a value between $1$ and $256$ . Hence the vector $r_{q_{l}}(x)$ is encoded into an $m$ -byte code of sub-centroid index $(j_{1},\cdots,j_{m})$ . In Figure 4(c), we assume that $c_{i}$ is the closest centroid to $x$ and can observe that the anchor point of each point $x$ is closer to $x$ than $c_{i}$ . So the dataset points can be encoded more accurately with the same code length. This will improve the recall rate of search, as can be seen in our evaluation in Section 5.

From Equation 13, the value of $\lambda_{ij}$ for each point can be computed. it is a float type value and requires 4 bytes for each data point. To further improve memory efficiency, we quantize it into 256 values and encode it by a byte. Empirically we find that the encoded $\lambda_{ij}$ still exhibits high recall rates.

3.3 Query

One important advantage of our indexing structure is that at query time, a specific query point only needs to traverse a small number of cells whose edges are closest to the query point, as shown in the right side of Figure 7. There are three steps for query processing: (1) region traversal, (2) distance computation and (3) re-ranking.

3.3.1 Region traversal

The region traversal process consists of two steps: first-level regions traversal and second-level regions traversal. During first-level regions traversal, a query point $y$ is quantized to its $w_{1}$ nearest first-level regions, which correspond to $w_{1}\cdot n$ second-level regions produced by quantizer $q_{v}$ . The subregions traversal is performed within only the $w_{1}\cdot n$ second-level regions. Moreover, $y$ is quantized again to $w_{2}$ nearest second-level regions by quantizer $q_{l}$ . Then the candidate list of $y$ is formed by the data points only within the $w_{2}$ nearest second-level regions. Because the $w_{2}$ second-level regions is obviously smaller than the $w_{1}$ first-level regions, the candidate list produced by our VLQ-based indexing structure is shorter than that produced by the VQ-based indexing structure. This will result in a faster query speed.

We use parameter $\alpha$ to determine the percentage of $w_{1}\cdot n$ second-level regions to be traversed give a query, such that $w_{2}=\alpha\cdot w_{1}\cdot n$ . We conduct a series of experiments in Section 5 to discuss the performance of our system with different values of $\alpha$ .

3.3.2 Distance computation

Distance computation is a prerequisite condition for re-ranking. In this section, we describe how to compute the approximate distance between a query point $y$ to a candidate point $x$ . According to [11], the distance from $y$ to $x$ can be evaluated by asymmetric distance computation (ADC) as follows:

[TABLE]

where $q_{1}(x)=(1-\lambda_{ij})\cdot c_{i}+\lambda_{ij}\cdot{s_{ij}}$ and $q_{2}(\cdots)$ is the PQ approximation of the $x_{i}$ displacement.

Expression 14 can be further decomposed as follows [3]:

[TABLE]

where $\langle\cdot,\cdot\rangle$ denotes the inner product between two points.

If $l(c_{i},s_{ij})$ is the closest edge to $x$ , i.e., $q_{1}(x)=(1-\lambda_{ij})c_{i}+\lambda_{ij}s_{ij}$ , Expression 15 can be transformed in the following way:

[TABLE]

According to Equation 7, term1 in Expression 16 can be computed in the following way:

[TABLE]

In Expression 16 and Equation 17, some computations can be done in advance and stored in lookup table as follows:

All of term2, term3, term4 and term7 are independent of the query. They can be precomputed from the codebooks. Term2 is the squared norm of the displacement approximation and can be stored in a table of size $256\times m$ . Term7 is the square of the length of the edge that the point $x$ belongs to and is already computed in the codebook learning process. Term3 and term4 are scalar products of the PQ sub-centroids and the corresponding first-level centroid subvectors and can be stored in a table of size $k\times 256\times m$ .

2.

Term6 and term8 are the distances from the query point to the first-layer centroids. They are the by-product of first-layer traversal.

3.

Term5 is the scalar product of the PQ sub-centroids and the corresponding query subvectors and can be computed independently before the search. Its computation costs $256\times D$ multiply-adds [6].

The proposed decomposition is used to simplify the distance computation. With the lookup tables, the distance computation only requires $256\times D$ multiply-adds and $2\times m$ lookup-adds. In comparison, the classic IVFADC distance computation requires $256\times D$ multiply-adds and $m$ lookup-adds [6]. The additional $m$ lookup-adds in our framework improves the distance computation accuracy with a moderate increase of time overhead. We will discuss this trade-off in detail in Section 5.

3.3.3 Re-ranking

Re-ranking is a step of re-sorting the candidate list of data points according to the distances from candidate points to the query point. It is the last step of the query process. The purpose of re-ranking is to find out the nearest neighbours to the query point among the candidate points by distance comparing.We apply the fast sorting algorithm of [6] to our re-ranking step. Due to the shorter candidate list and more accurate approximate distances, the re-ranking step of our system is both faster and more accurate than that of Faiss.

4 GPU Implementation

One advantage of our VLQ-ADC framework is that it is amenable to implementations on GPUs. It is mainly because our searching and distance computing algorithm that applied during query can be efficiently parallelized on GPUs. In this work we have implemented our framework in CUDA.

There are three different levels of granularity of parallelism on GPU: threads, blocks and grids. A block is composed of multiple threads, and a grid is composed of multiple blocks. Furthermore, there are three memory types on GPU. Global memory is typically $4$ – $32$ GB in size with $5$ – $10\times$ higher bandwidth than CPU main memory [6], and can be shared by different blocks. Shared memory is similar to CPU L1 cache in terms of speed and is only shared by threads within the same block. GPU register file memory has the highest bandwidth and the size of register file memory on GPU is much larger than that on CPU [6].

VLQ-ADC is able to utilize GPU efficiently for indexing and search. For example, we use blocks to process $D$ -dimensional query points and the threads of a block to traverse the inverted lists. We use global memory to store the indexing structures and compressed dataset that is shared by all blocks and grids, and load part of lookup tables in the shared memory to accelerate distance computation. As the GPU register file memory is very large, we store structured data in the register file memory to increase the performance of the sorting algorithm.

Algorithm 3 summarizes our search process that is implemented on GPU. We use four arrays to store the information of the inverted index lists. The first array stores the length of each index list, the second one stores the sorted vector IDs of each list, and the third the fourth store the corresponding codes and $\lambda$ values of each list respectively. For an NVIDIA GTX Titan X GPU with a 12GB of RAM, we load part of the dataset indexing structure in the global memory for different kernels, i.e., region size, data points compressed codes and $\lambda$ values of each list. A kernel is the unit of work (instruction stream with arguments) scheduled by the host CPU and executed by GPUs [6]. We load the vector IDs on the CPU side, because vector IDs are resolved only if re-ranking step determines K-nearest membership. This lookup produces a few sparse memory reads in a large array, thus the IDs stored on CPU can only cause a tiny performance cost.

Our implementation makes use of some basic functions from the Faiss library, including matrix multiplication and the K-selection algorithm to improve the performance of our approach444The source code will be released upon publication..

K-selection

The K-selection algorithm is a high-performance GPU-based sorting method proposed by Faiss [6] and GSKNN [8]. The K-selection keep intermediate data in the register file memory. It exchanges register data using the warp shuffle instruction, enabling warp-wide parallelism and storage. The warp is a 32-wide vector of GPU threads, each thread in the warp has up to 255 32-bit registers in a shared register file. All the threads in the same warp can exchange register data using the warp shuffle instruction.

List search

We use two kernels for inverted list search. The first kernel is responsible for quantizing each query point to $w_{1}$ nearest first-level regions (line 3 in Algorithm 3). The second kernel is responsible for finding out the $w_{2}$ nearest second-level regions for the query point (line 4 in Algorithm 3). The distances between each query point and its $w$ nearest centroids are stored for further calculation. In the two kernels, we use a block of threads to process one query point, thus a batch of $n_{q}$ query points can be processed concurrently.

Distance computation and re-ranking

After the inverted lists $\mathcal{L}_{\textit{i}}$ of each query point are collected, there are up to $n_{q}\times w_{2}\times\max|\mathcal{L}_{\textit{i}}|$ candidate points to process. During the distance computation and re-ranking process, processing all the query points in a batch yields high parallelism, but can exceed available GPU global memory. Hence, we choose a tile size $t_{q}<n_{q}$ based on amount of available memory to reduce memory overhead, bounding its complexity by $\mathcal{O}(t_{q}\times w_{2}\times max|\mathcal{L}_{i}|)$ .

We use one kernel to compute the distances from each query point to the candidate points according to Expression 16, and sort the distances via the K-selection algorithm in a separate kernel. The lookup tables are stored in the global memory. In the distance computation kernel, we use a block to scan all $w_{1}$ inverted lists for a single query point, and the significant portion of the runtime is the $2\times w_{2}\times m$ lookups in the lookup tables and the linear scanning of the $\mathcal{L}_{i}$ from global memory.

In the re-ranking kernel, we refer to Faiss by using a two-pass K-selection. First reduce $t_{q}\times w_{2}\times max|\mathcal{L}_{\textit{i}}|)$ to $t_{q}\times\tau\times K$ partial results, where $\tau$ is some subdivision factor, then the partial results are reduced again via k-selection to the final $t_{q}\times K$ results.

Due to the limited amount of GPU’s memory, if an index instance with long encoding length cannot fit in the memory of a single GPU, it cannot be processed one the GPU efficiently. Our framework supports multi-GPU parallelism to process indexing instance of a long encoding length. For $b$ GPUs, we split the index instance into $b$ parts, each of which can fit in the memory of a single GPU. We then process the local search of $n_{q}$ queries on each GPU, and finally join the partial results on one GPU. Our multi-GPU system is based on MPI, which can be easily extended to multiple GPUs on multiple servers.

5 Experiments and Evaluation

In this section, we evaluate the performance of our system VLQ-ADC and compare it to three state-of-the-art billion-scale retrieval systems that are based on different indexing structures and implemented on CPUs or GPUs: Faiss [6], Ivf-hnsw [7] and Multi-D-ADC [16]. All the systems are evaluated on the standard metrics: accuracy and query time, with different code lengths. All the experiments are conducted on a machine with two 2.1GHz Intel Xeon E5-2620 v4 CPUs and two NVIDIA GTX Titan X GPUs with 12 GB memory each.

The evaluation is performed on two public benchmark datasets that are commonly used to evaluate billion-scale ANN search: SIFT1B [22] of $10^{9}$ 128-D vectors and DEEP1B [5] of $10^{9}$ 96-D vectors. Each dataset has a 10,000 query set with the precomputed ground-truth nearest neighbors. For our system, we sample $2\times 10^{6}$ vectors from each dataset for learning all the trainable parameters. We evaluate the search accuracy by the test result Recall@ $K$ , which is the rate of queries for which the nearest neighbors is in the top $K$ results.

Here we choose nprobe =64 for all the inverted indexing systems (Faiss, Ivf-hnsw and VLQ-ADC), as 64 is a typical value for nprobe in the Faiss system. The parameter max_codes that means the max number of candidate data points for a query is only useful to CPU-based system (max_codes is set to 100,000), hence for GPU-based systems like Faiss and VLQ-ADC, max_codes parameter is not configured. In fact, we compute the distances of query point to all the data points that are contained in the neighbor regions.

5.1 Evaluation without re-ranking

In experiment 1, we evaluate the index quality of each retrieval system. We compare three different inverted index structures and two inverted multi-index schemes with different codebooks sizes without the re-ranking step.

Faiss. We build a codebook of $k=2^{18}$ centroids by k-means, and find proposed inverted lists of each query by Faiss. 2. 2.

Ivf-hnsw. We use a codebook of $k=2^{16}$ centroids by k-means, and set 64 sub-centroids for each first-level centroid.555We use the implementation of Ivf-hnsw that is available online

(https://github.com/dbaranchuk/ivf-hnsw) for all the experiments. 3. 3.

Multi-D-ADC. We use two IMI schemes with two codebook sizes $k=2^{10}$ and $k=2^{12}$ and choose the implementation from the Faiss library for all the experiments. 4. 4.

VLQ-ADC. For our approach, we use the same codebook as Ivf-hnsw, and a 64-edge k-NN graph with indexing and querying as described in Section 3.2 and 3.3.666The VLQ-ADC source code is available at https://github.com/zjuchenwei/vector-line-quantization.

The recall curves of each indexing approach are presented in Figure 10. On both datasets, our proposed system VLQ-ADC (blue curve) outperforms the other two inverted index systems and the Multi-D-ADC scheme with small codebooks ( $k=2^{10}$ ) for all the reasonable range of $X$ . Compared with the Multi-D-ADC scheme with a larger codebook ( $k=2^{12}$ ), our system performs better on DEEP1B, and almost equally well on SIFT1B.

On the DEEP1B dataset, the recall rate of our system is consistently higher than that of all the other indexing structures. With a codebook that is only 1/4 the size of Faiss’ codebook, the recall rate of our inverted index is higher than Faiss. This demonstrates that the line quantization procedure can further improve the index quality than the previous inverted index methods.

Even on the SIFT1B dataset, the performance of our indexing structure is almost the same as that of IMI with much larger codebook $k=10^{12}$ and much better than other inverted index structures.

As shown in Figure 10, for the SIFT1B dataset, the IMI with $k=2^{12}$ can generate better candidate list than the inverted indexing structures. While for the DEEP1B dataset, the performance of the IMI falls behind that of the inverted indexing structures. The reasons are that SIFT vectors are histogram-based and the subvectors are corresponding to the different subspaces, which describe disjoint image parts that have weak correlations in the subspace distributions. On the other hand, the DEEP vectors are produced by CNN that have a lot of correlations between the subspaces. It can be observed that the performances of our indexing structure is consistent across the two datasets. This demonstrates that our indexing structure’s suitability for different data distributions.

5.2 Evaluation with re-ranking

In experiment 2, we evaluate the recall rates with the re-ranking step. In all systems the dataset points are encoded in the same way: indexing and encoding. (1) Indexing: displacements from data points to the nearest cell centroids are calculated. For VLQ-ADC the displacements are calculated from data points to the nearest anchor points on the line. (2) Encoding: the residual values are encoded into 8 or 16 bytes by PQ with the same codebooks shared by all the cells. Here we compare the same four retrieval systems as in experiments 1. All the configurations for the retrieval systems are the same as in experiment 1. For the GPU-based systems, we evaluate performance with 8-byte codes on 1 GPU and 16-byte codes on 2 GPUs.

The Recall@ $K$ values for different values $K=1/10/100$ and the average query times on both datasets in milliseconds (ms) are presented in Table 3. From Table 3 we can make the following important observations.

Overall best recall performance.

Our system VLQ-ADC achieves best recall performance for both datasets and the two codebooks (8-byte and 16-byte) in most cases. For the twelve recall values (Recall@1/10/100 $\times$ two codebooks $\times$ two datasets), VLQ-ADC achieves best values in nine cases and second best in two cases. The second-best system is Faiss, obtaining best results in two cases. Multi-D-ADC (with $k=2^{12}\times 2^{12}$ regions) obtains best results in one case.

Substantial speedup.

VLQ-ADC is consistently and significantly faster than all the other systems in all experiments. For all configurations, VLQ-ADC’s query time is within 0.054–0.068 milliseconds, while the other systems’ query times vary greatly. In the most extreme case, VLQ-ADC is $125\times$ faster than Multi-D-ADC (0.068 vs 8.54). At the same time, VLQ-ADC is also consistently faster than the second fastest system, the GPU-based Faiss, by an average $5\times$ speedup.

Comparison with Faiss.

VLQ-ADC outperforms the current state-of-the-art GPU-based system Faiss in terms of both accuracy and query time by a large margin, except for only three out of sixteen cases (R@10 with 16-byte codes for SIFT1B, and R@100 with 16-byte codes for SIFT1B and DEEP1B). E.g., as a GPU-based system, VLQ-ADC outperforms Faiss in terms of accuracy by 17%, 14%, 4% of R@1, R@10 and R@100 respectively on the SIFT1B dataset and 8-byte codes. At the same time, the query time is consistently and significantly faster than Faiss, with a speedup of up to $5.7\times$ . Faiss outperforms VLQ-ADC in recall values in three cases, all with 16-byte codes. However, the difference is negligible ( $\sim$ 0.02%). Similarly, though less pronounced, characteristics can be observed on DEEP1B.

The main reason for this improvement is that the index quality and encoding precision in VLQ-ADC is better than those of Faiss. Due to the better indexing quality, the inverted list of our system is much shorter than that of Faiss, which results in a much shorter query time.

Additionally, although the codebook size of our system ( $k=2^{16}$ ) is only $1/4$ of that of Faiss ( $k=2^{18}$ ), our system produces more regions ( $2^{22}$ ) than Faiss ( $2^{18}$ ). Therefore, our system achieves better accuracy as well as memory and runtime efficiency than Faiss.

Comparison with Multi-D-ADC.

The proposed system also outperforms the IMI based system Multi-D-ADC both in terms of accuracy and query time on both datasets. For example, VLQ-ADC leads Multi-D-ADC with codebooks $k=2^{12}$ by 14.2%, 7.4%, 1.3% of R@1, R@10 and R@100 respectively on the SIFT1B dataset and 8-byte codes with up to $6.8\times$ speedup. On the DEEP1B dataset, the advantage of our system is even more pronounced. Similarly, VLQ-ADC outperforms Multi-D-ADC scheme with smaller codebooks $k=2^{10}$ even more significantly, especially in terms of query time, where VLQ-ADC consistently achieves speedups of at least one order of magnitude while obtaining better recall values.

Comparison with Ivf-hnsw.

Similarly, VLQ-ADC outperforms Ivf-hnsw, another CPU-based retrieval system in both recall and query time. Although Ivf-hnsw can also produce more regions with a small codebook, it still cannot outperform the VQ-based indexing structure with larger size of codebook.

Effects on recall of indexing and encoding.

The improvement of R@10 and R@100 shows that the second-level line quantization provides more accurate short-list of candidates than the previous inverted index structure, and the improvement of R@1 shows that it can also improves encoding accuracy.

Multi-D-ADC.

From Table 3, we can also observe that Multi-D-ADC scheme with $k=2^{12}$ outperforms the scheme with $k=2^{10}$ in query time by a large margin. It is mainly because Multi-D-ADC with larger codebooks can produce more regions, which can extract more concise and accurate short-lists of candidates.

5.3 Data point distributions of different indexing structures

The space and time efficiency of an indexing structure is impacted by the distribution of data points produced by the structure. To analyse the distribution produced by the structures studied in this paper, we plot in Figure 13 the percentages of regions by the discretized number of data points in each region.

As shown in Figure 13, the portion of empty regions produced by the inverted indexing structures (Faiss, Ivf-hnsw and VLQ-ADC) is much less than that produced by the inverted multi-index structure (Multi-D-ADC). For Multi-D-ADC, there are 38% empty regions for SIFT1B and 58% empty regions for DEEP1B (left most group in each plot). This result empirically validates the space inefficiency of inverted multi-index structure [16].

For Faiss, which is based on the inverted indexing structure using VQ, over 98% and 93% of regions contain more than 500 data points for SIFT1B and DEEP1B respectively. This will possibly produce long candidate lists for queries, thus negatively impacting query speed. For VLQ-ADC (and Ivf-hnsw), the regions are much more evenly distributed. The majority of the regions on both datasets contain less than 500 data points, and more regions contain 101–300 data points than others. This is a main reason why VLQ-ADC can provide shorter candidate lists and thus a faster query speed.

5.4 Performance comparison under different dataset scales

In this section we evaluate the performance of our system under different dataset scales. Figure 16 shows, for SIFT1B and DEEP1B, the recall and query time values for Faiss and VLQ-ADC for subsets of SIFT and DEEP1B of different sizes: 1M, 10M, 100M, 300M and 1000M (full dataset) respectively. As can be seen in the figure, the recall values of VLQ-ADC is always higher than that of Faiss under all dataset scales. When the scale of dataset is under 300M, the query speed of Faiss is slightly faster than that of VLQ-ADC. When the scale of the datasets is over 300M, the query speed of VLQ-ADC matches that of Faiss.

It can also been observed from the figure that for the full datasets of SIFT1B and DEEP1B (1000M), Faiss takes 0.31ms and 0.32ms respectively (see Table 3 too). Compared to the 100M subsets of these two datasets, Faiss suffers an approx. $15\times$ slowdown when data scale grows $10\times$ . On the other hand, for these two datasets, VLQ-ADC takes 0.054ms and 0.059ms respectively, representing only an approx. $2\times$ slowdown when data scale grows $10\times$ . The superior scalability and robustness of VLQ-ADC over Faiss is evident from this experiment.

5.5 Evaluation on impact of parameter values

Number of centroids $k$ and edges $n$ . We evaluate the performance of VLQ-ADC on different $k$ and $n$ values with 8-byte codes. We first fix the value of $n$ to 64 and compare the performance of our system for different $k$ centroids. In Figure 20, we present the evaluation of VLQ-ADC for $k=2^{16}/2^{17}/2^{18}$ . Then we fix $k=2^{16}$ and increase the number of edge $n$ from 32 to 64 and 128. In Figure 24, we present the evaluation of the VLQ-ADC for different edge numbers.

From Figure 20 and 24 we can observe that the increase in the number of centroids and edges can improve search accuracy, while slightly increasing query time. This is because the indexing scheme with more centroids and more edges can represent the dataset points more accurately and hence provide more accurate short inverted lists.

Value of portion $\alpha$ . Now, we discuss how to determine the value of parameter $\alpha$ for subregions pruning, as described in Section 3.3.1. As shown in Figure 28, we test several values of $\alpha$ on both datasets. A lower $\alpha$ value means fewer subregions will be traversed, hence lower query time. At the same time, we can observe that higher $\alpha$ values only moderately increase recall values, while significantly increases query time (up to $3.7\times$ times). Hence we choose $\alpha=0.25$ .

Time and memory consumption. Because the billion-scale dataset do not fit on the GPU, the database is built in batches of 2M vectors, then aggregating the information on the CPU. With file I/O, it takes about 150 minutes to build the whole database on a single GPU.

Here we analyze the memory consumption of each system. As shown in Table 2, for a database of $N=10^{9}$ points, the basic memory consumption for all systems is $4\cdot N$ bytes for point IDs that are Integer type and $m\cdot N$ bytes for point codes. In addition to that, Multi-D-ADC consumes $4\cdot k^{2}$ bytes to store the region boundaries. Faiss consumes $4\cdot k\cdot D$ bytes for the codebooks and $4\cdot k\cdot m\cdot 256$ bytes for the lookup tables. Ivf-hnsw requires $N$ bytes for quantized norm items $4\cdot k\cdot(D+n)$ bytes for its indexing structure[7]. For our system, we require $N$ bytes for quantized $\lambda$ values and $4\cdot k\cdot(D+2n+m\cdot 256)$ bytes for the codebook, the $n$ -NN graph and the lookup tables. We summarize the total memory consumption for all systems in Table 4 with 8-byte encoding length on both datasets.

As presented in Table 4, the memory consumption of our system is less than that of Faiss, and about 10% more than that of Multi-D-ADC with $2^{12}$ codebook, which is acceptable for most realistic setups.

6 Conclusion

Billion-scale approximate nearest neighbor (ANN) search has become an important task as massive amounts of visual data becomes available online. In this work, we proposed VLQ-ADC, a simple yet scalable indexing structure and a retrieval system that is capable of handling billion-scale datasets. VLQ-ADC combines line quantization with vector quantization to create a hierarchical indexing structure. Search space is further pruned to a portion of the closest regions, further improving ANN search performance. The proposed indexing structure can partition the billion-scale database in large number of regions with a moderate size of codebook, which solved the drawback of prior VQ-based indexing structures.

We performed comprehensive evaluation on two billion-scale benchmark datasets: SIFT1B and DEEP1B and three state-of-the-art ANN search systems: Multi-D-ADC, Ivf-hnsw, and Faiss. Our evaluation shows that VLQ-ADC consistently outperforms all three systems on both recall and query time. VLQ-ADC achieves a recall improvement over Faiss, the state-of-the-art GPU-based system, of up to 17% and a query time speedup of up to $5\times$ times.

Moreover, VLQ-ADC takes the data distribution into account in the indexing structure. As a result, it performs well on datasets with different distributions. Our evaluation shows that VLQ-ADC is the best performer on both SIFT1B and DEEP1B, demonstrating its robustness with respect to data with different distributions.

We conclude by pointing out a number of future work directions. We plan to investigate further improvements to the indexing structure. Moreover, a more systematic and principled method for hyperparameter selection is worthy investigation.

Acknowledgment

This work is supported in part by the National Natural Science Foundation of China under Grant No.61672246, No.61272068, No.61672254 and the Fundamental Research Funds for the Central Universities under Grant HUST:2016YXMS018. In addition, we gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPUs used for this research. The authors appreciate the valuable suggestions from the anonymous reviewers and the Editors.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Chan et al. [2018] T. H. H. Chan, A. Guerqin, M. Sozio, Fully dynamic k-center clustering, in: World Wide Web Conference, 2018, pp. 579–587.
2Weber et al. [1998] R. Weber, H.-J. Schek, S. Blott, A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces, in: VLDB, volume 98, 1998, pp. 194–205.
3Babenko and Lempitsky [2014] A. Babenko, V. Lempitsky, Improving bilayer product quantization for billion-scale approximate nearest neighbors in high dimensions, ar Xiv preprint ar Xiv:1404.1831 (2014).
4Wieschollek et al. [2016] P. Wieschollek, O. Wang, A. Sorkine-Hornung, H. Lensch, Efficient large-scale approximate nearest neighbor search on the GPU, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2027–2035.
5Yandex and Lempitsky [2016] A. B. Yandex, V. Lempitsky, Efficient indexing of billion-scale datasets of deep descriptors, in: Computer Vision and Pattern Recognition, 2016, pp. 2055–2063.
6Johnson et al. [2017] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with GP Us, ar Xiv preprint ar Xiv:1702.08734 (2017).
7Baranchuk et al. [2018] D. Baranchuk, A. Babenko, Y. Malkov, Revisiting the inverted indices for billion-scale approximate nearest neighbors, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 202–216.
8Yu et al. [2015] C. D. Yu, J. Huang, W. Austin, B. Xiao, G. Biros, Performance optimization for the k-nearest neighbors kernel on x 86 architectures, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, 2015, p. 7.