Accelerated Nearest Neighbor Search with Quick ADC

Fabien Andr\'e (Technicolor); Anne-Marie Kermarrec (Inria) and; Nicolas Le Scouarnec (Technicolor)

arXiv:1704.07355·cs.CV·April 25, 2017

Accelerated Nearest Neighbor Search with Quick ADC

Fabien Andr\'e (Technicolor), Anne-Marie Kermarrec (Inria) and, Nicolas Le Scouarnec (Technicolor)

PDF

1 Repo

TL;DR

This paper introduces Quick ADC, a SIMD-optimized method for fast nearest neighbor search using product quantization, significantly reducing computation time while maintaining high accuracy on large datasets.

Contribution

Quick ADC innovates by combining 4-bit sub-quantizers and floating-point distance quantization to accelerate ADC-based NN search with SIMD instructions.

Findings

01

Achieves 3-6x speedup over traditional ADC

02

Attains Recall@100 of 0.94 in 3.4 ms on 1 billion descriptors

03

Outperforms state-of-the-art systems in speed and accuracy

Abstract

Efficient Nearest Neighbor (NN) search in high-dimensional spaces is a foundation of many multimedia retrieval systems. Because it offers low responses times, Product Quantization (PQ) is a popular solution. PQ compresses high-dimensional vectors into short codes using several sub-quantizers, which enables in-RAM storage of large databases. This allows fast answers to NN queries, without accessing the SSD or HDD. The key feature of PQ is that it can compute distances between short codes and high-dimensional vectors using cache-resident lookup tables. The efficiency of this technique, named Asymmetric Distance Computation (ADC), remains limited because it performs many cache accesses. In this paper, we introduce Quick ADC, a novel technique that achieves a 3 to 6 times speedup over ADC by exploiting Single Instruction Multiple Data (SIMD) units available in current CPUs. Efficiently…

Tables4

Table 1. Table 1. Speed-Accuracy tradeoff for 64-bit codes (SIFT1M, Exhaustive search)

$m \times b$	Size	Cache	R@100	Tables	Scan
$16 \times 4$	1 KiB	L1	83.1%	0.001 ms	6.1 ms
$8 \times 8$	8 KiB	L1	91.6%	0.005 ms	2.7 ms
$4 \times 16$	1 MiB	L3	96.5%	0.77 ms	7.8 ms

Table 2. Table 2. Systems

	CPU	RAM
workstation	Xeon E5-1650v3	16GB	DDR4 2133Mhz
server	Xeon E5-2630v3	128GB	DDR4 1866Mhz

Table 3. Table 3. Datasets

	Base set	Learning set	Query set	Dim.
SIFT1M	1M	100K	10K (1K)	128
SIFT1B	1000M	100M (2M)	10K (1K)	128
GIST1M	1M	500K	1K	960
Deep1M	1M	300K	1K	256

Table 4. Table 4. Non-exhaustive search, SIFT1M, 64 bit

SIFT1M, IVF, K=256, ma=24
PQ	ADC ^*	R@100	Index	Tables	Scan	Total
PQ	ADC	0.949	0.008	0.18	0.3	0.48
	QADC	0.907	0.008	0.055	0.072	0.14
		-4.4%		-69%	-76%	-72%
OPQ	ADC	0.963	0.008	0.21	0.29	0.52
	QADC	0.949	0.008	0.089	0.073	0.17
		-1.5%		-59%	-75%	-67%

Equations21

q (x) = c_{i} \in C arg min ∣∣ x - c_{i} ∣∣ .

q (x) = c_{i} \in C arg min ∣∣ x - c_{i} ∣∣ .

enc (x) = i, such that q (x) = c_{i}

enc (x) = i, such that q (x) = c_{i}

pq (x)

pq (x)

= (c_{i_{0}}^{0}, \dots, c_{i_{m - 1}}^{m - 1})

C = C^{0} \times \dots \times C^{m - 1}

C = C^{0} \times \dots \times C^{m - 1}

enc (x) = (i_{0}, \dots, i_{m - 1}), such that q (x) = (c_{i_{0}}^{0}, \dots, c_{i_{m - 1}}^{m - 1})

enc (x) = (i_{0}, \dots, i_{m - 1}), such that q (x) = (c_{i_{0}}^{0}, \dots, c_{i_{m - 1}}^{m - 1})

opq (x) = pq (R x), such that R^{T} R = I,

opq (x) = pq (R x), such that R^{T} R = I,

r (x) = x - q_{i} (x)

r (x) = x - q_{i} (x)

D^{j} = (y^{'}^{j} - C^{j} [0]^{2}, \dots, y^{'}^{j} - C^{j} [k - 1]^{2})

D^{j} = (y^{'}^{j} - C^{j} [0]^{2}, \dots, y^{'}^{j} - C^{j} [k - 1]^{2})

adc (y, c) = j = 0 \sum m - 1 D^{j} [c [j]]

adc (y, c) = j = 0 \sum m - 1 D^{j} [c [j]]

adc (y, c) = j = 0 \sum m - 1 y^{'}^{j} - C^{j} [c [j]]^{2}

adc (y, c) = j = 0 \sum m - 1 y^{'}^{j} - C^{j} [c [j]]^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

technicolor-research/quick-adc
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConvolution · Non Maximum Suppression · 1x1 Convolution · SSD

Full text

Accelerated Nearest Neighbor Search with Quick ADC

Fabien André

Technicolor

0000-0001-8620-7632

[email protected]

,

Anne-Marie Kermarrec

Inria

[email protected]

and

Nicolas Le Scouarnec

Technicolor

[email protected]

(2017)

Abstract.

Efficient Nearest Neighbor (NN) search in high-dimensional spaces is a foundation of many multimedia retrieval systems. Because it offers low responses times, Product Quantization (PQ) is a popular solution. PQ compresses high-dimensional vectors into short codes using several sub-quantizers, which enables in-RAM storage of large databases. This allows fast answers to NN queries, without accessing the SSD or HDD. The key feature of PQ is that it can compute distances between short codes and high-dimensional vectors using cache-resident lookup tables. The efficiency of this technique, named Asymmetric Distance Computation (ADC), remains limited because it performs many cache accesses.

In this paper, we introduce Quick ADC, a novel technique that achieves a 3 to 6 times speedup over ADC by exploiting Single Instruction Multiple Data (SIMD) units available in current CPUs. Efficiently exploiting SIMD requires algorithmic changes to the ADC procedure. Namely, Quick ADC relies on two key modifications of ADC: (i) the use 4-bit sub-quantizers instead of the standard 8-bit sub-quantizers and (ii) the quantization of floating-point distances. This allows Quick ADC to exceed the performance of state-of-the-art systems, e.g., it achieves a Recall@100 of 0.94 in 3.4 ms on 1 billion SIFT descriptors (128-bit codes).

Large-Scale Multimedia Search; Multimedia Search Acceleration; Product Quantization; SIMD

††journalyear: 2017††copyright: licensedothergov††conference: ICMR ’17; June 06-09, 2017; Bucharest, Romania††price: 15.00††doi: http://dx.doi.org/10.1145/3078971.3078992††isbn: 978-1-4503-4701-3/17/06

1. Introduction

The Nearest Neighbor (NN) search problem consists in finding the closest vector $x$ to a query vector $y$ among a database of $N$ $d$ -dimensional vectors. Efficient NN search in high-dimensional spaces is a requirement in many multimedia retrieval applications, such as image similarity search, image classification, or object recognition. These problems typically involve extracting high-dimensional feature vectors, or descriptors, and finding the NN of the extracted descriptors among a database of descriptors. For images, SIFT (Lowe, 1999) and GIST descriptors (Oliva and Torralba, 2001) are commonly used.

Although efficient NN search solutions have been proposed for low-dimensional spaces, exact NN search remains challenging in high-dimensional spaces due to the notorious curse of dimensionality. As a consequence, much research work has been devoted to Approximate Nearest Neighbor (ANN) search. ANN search returns sufficiently close neighbors instead of the exact NN. Product Quantization (PQ) (Jégou et al., 2011) is a widely used (Krapac et al., 2014; Xie et al., 2015) ANN search approach. PQ compresses high-dimensional vectors into short codes of a few bytes, enabling in-RAM storage of large databases. This allows fast answers to ANN queries, without SSD or HDD accesses.

The key feature of PQ is that it allows computing distances between uncompressed query vectors and compressed database vectors. This technique, known as Asymmetric Distance Computation (ADC), relies on cache-resident lookup tables. Although ADC is faster than distance computations in high-dimensional spaces, its efficiency remains low because it performs many cache accesses. To date, much of the research work has been devoted to the development of efficient inverted indexes (Babenko and Lempitsky, 2015a; Xia et al., 2013), which reduce the number of ADCs required to answer NN queries. Recently, there also has been an interest in increasing the performance of the ADC procedure itself with the introduction of PQ Fast Scan (André et al., 2015). Unfortunately, PQ Fast Scan cannot be combined with efficient inverted indexes, limiting its usefulness in practical cases. In this paper, we introduce Quick ADC, a high-performance ADC procedure that can be combined with inverted indexes. More specifically, this paper makes two contributions, detailed in the next two paragraphs.

First, we detail the design of Quick ADC. Like PQ Fast Scan, Quick ADC replaces cache accesses by SIMD in-register shuffles to accelerate the ADC procedure. Exploiting SIMD in-register shuffles requires storing the lookup tables used by the ADC procedure in SIMD registers. However, these registers are much smaller than the lookup tables used by the conventional ADC procedure. Therefore, algorithmic changes are required to obtain small lookup tables that fit SIMD registers. PQ Fast Scan obtains such small lookup tables by grouping the codes of the database. This approach prevents PQ Fast Scan from being combined with inverted indexes. Quick ADC takes a different approach to obtain small lookup tables, which is compatible with inverted indexes. Namely, Quick ADC relies on two key ideas: (i) the use of 4-bit sub-quantizers, instead of the standard 8-bit sub-quantizers, and (ii) the quantization of floating-point distances to 8-bit integers.

Second, we implement Quick ADC and evaluate its performance in a wide range of scenarios. It is known that the use of 4-bit quantizers instead of the common 8-bit quantizer can cause a loss of recall (Jégou et al., 2011). However, we show that this loss is small or negligible, especially when combining Quick ADC with inverted indexes and Optimized Product Quantization (OPQ), a variant of PQ. On the SIFT1B dataset, Quick ADC achieves a better speed-accuracy tradeoff than the state-of-art OMulti-D-OADC system (Ge et al., 2014; Babenko and Lempitsky, 2015a), e.g., Quick ADC achieves a Recall@100 of 0.94 in 3.4 ms (128-bit codes).

2. Background

In this section, we describe how Product Quantizers (PQ) and Optimized Product Quantizers (OPQ) encode vectors into short codes. We then detail the ANN search process in databases of short codes. Lastly, we analyze the impact of PQ parameters on ANN search speed and recall.

2.1. Vector Encoding

Vector Quantizers. To encode vectors as short codes, PQ builds on vector quantizers. A vector quantizer, or quantizer, is a function $\operatorname{q}$ which maps a vector $x\in\mathbb{R}^{d}$ , to a vector $c_{i}\in\mathbb{R}^{d}$ belonging to a predefined set of vectors $\mathcal{C}$ . Vectors $c_{i}$ are called centroids, and the set of centroids $\mathcal{C}$ , of cardinality $k$ , is the codebook. For a given codebook $\mathcal{C}$ , a quantizer which minimizes the quantization error must satisfy Lloyd’s condition and map the vector $x$ to its closest centroid $c_{i}$ :

[TABLE]

A vector quantizer can be used to encode a vector $x\in\mathbb{R}^{d}$ into a short code $i\in\{0\dots k-1\}$ using the encoder $\operatorname{enc}$ :

[TABLE]

The short code $i$ only occupies $b=\lceil\log_{2}(k)\rceil$ bits, which is typically much lower the $d\cdot 32$ bits occupied by a vector $x\in\mathbb{R}^{d}$ stored as an array of $d$ single-precision floats (32 bit each). To maintain the quantization error low enough for ANN search, a very large codebook e.g., $k=2^{64}$ or $k=2^{128}$ is required. However, training such codebooks is not tractable both in terms of processing and memory requirements.

Product Quantizers. Product quantizers overcome this issue by dividing a vector $x\in\mathbb{R}^{d}$ into $m$ sub-vectors, $x=(x^{0},\dots,x^{m-1})$ , assuming that $d$ is a multiple of $m$ . Each sub-vector $x^{j}\in\mathbb{R}^{d/m}$ , $j\in\{0,\dots,m-1\}$ is quantized using a sub-quantizer $\operatorname{q}^{j}$ . Each sub-quantizer $\operatorname{q}^{j}$ has a distinct codebook $\mathcal{C}^{j}=(c_{i}^{j})_{i=0}^{k-1}$ of cardinality $k$ . A product quantizer $\operatorname{pq}$ maps a vector $x\in\mathbb{R}^{d}$ as follows:

[TABLE]

The codebook $\mathcal{C}$ of the product quantizer $q$ is given by the cartesian product of the sub-quantizers codebooks:

[TABLE]

The cardinality of the product quantizer codebook $\mathcal{C}$ is $k^{m}$ . Thus, a product quantizer is able to produce a large number of centroids $k^{m}$ while only requiring storing and training $m$ codebooks of cardinality $k$ . A product quantizer can be used to encode a vector $x$ into a short code, by concatenating codes produced by sub-quantizers:

[TABLE]

The short code $(i_{0},\dots,i_{m-1})$ requires $\lceil\log_{2}(k^{m})\rceil=m\cdot b$ bits of storage, where $b=\lceil\log_{2}(k)\rceil$ .

Optimized Product Quantizers. Cartesian k-means (CKM) (Norouzi and Fleet, 2013) and Optimized Product Quantizers (OPQ) (Ge et al., 2014) and optimize the sub-space decomposition by multiplying the vector $x$ by an orthonormal matrix $R\in\mathbb{R}^{d\times d}$ before quantization. The matrix $R$ allows for arbitrary rotation and permutation of vector components. An optimized product quantizer $\operatorname{opq}$ maps a vector $x$ as follows:

[TABLE]

where $\operatorname{pq}$ is a product quantizer. Optimized product quantizers can be used to encode vectors into short codes like product quantizers.

2.2. Inverted Indexes

The simplest search strategy, exhaustive search, involves encoding database vectors as short codes using PQ or OPQ and storing short codes in RAM. At query time, the whole database is scanned for nearest neighbors.

The more refined non-exhaustive search strategy relies on inverted indexes (or IVF) (Jégou et al., 2011; Jégou et al., 2011) to avoid scanning the whole database. An inverted index uses a quantizer $\operatorname{q_{i}}$ to partition the input vector space into $K$ Voronoi cells. Vectors lying in each cell are stored in an inverted list. At query time, the inverted index is used to find the closest cells to the query vector, which are then scanned. Inverted indexes therefore offer a lower query response time. When adding a vector $x$ to an indexed database, its residual $\operatorname{r}(x)$ is first computed:

[TABLE]

The residual $\operatorname{r}(x)$ is then encoded into a short code using a product quantizer. This code is then stored in the appropriate inverted list of the inverted index. Indexed databases therefore use two quantizers: a quantizer for the index ( $\operatorname{q_{i}}$ ) and a product quantizer to encode residuals into short codes. The energy of residuals $\operatorname{r}(x)$ is smaller than the energy of input vectors $x$ , thus there is a lower quantization error when encoding residuals into short codes. Non-exhaustive search therefore offers a higher recall than exhaustive search in addition to the lower response time. Inverted indexes however incur a memory overhead (usually 4 bytes per database vector). This memory overhead is negligible in the case of small databases ( $\sim 4$ MB for 1 million vectors) and for large databases, exhaustive search is anyway hardly tractable. Non-exhaustive search is therefore preferred to exhaustive search in most cases.

2.3. ANN Search

ANN search in a database of short codes consists in three steps: Index, which involves retrieving inverted lists from the index, Tables, which involves computing lookup tables to speed up distance computations and Scan which involves computing distances between the query vector and short codes using the pre-computed lookup tables. Obviously, the step Index is only required for non-exhaustive search, and is skipped in the case of exhaustive search. We detail these three steps in the three following paragraphs.

Index. In this step, the Voronoi cell of the inverted index quantizer $\operatorname{q_{i}}$ in which the query vector $y$ lies is determined. The residual $\operatorname{r}(y)$ of the query vector is also computed. In practice, to improve recall, the $\mathit{ma}$ closest cells (typically, $\mathit{ma}=8$ to $64$ ) are selected. For the sake of simplicity, this section describes the ANN search process for $\mathit{ma}=1$ , but each operation is repeated $\mathit{ma}$ times: $\mathit{ma}$ cells are selected, $\mathit{ma}$ sets of lookup tables are computed and $\mathit{ma}$ cells are searched. In the case of exhaustive search no residual is computed and the query vector is used as-is. In the remainder of this section, $y^{\prime}=\operatorname{r}(y)$ for non-exhaustive search, and $y^{\prime}=y$ for exhaustive search.

Tables. In this step, a set of $m$ lookup tables are computed $\{D^{j}\}_{j=0}^{m}$ , where $m$ is the number of sub-quantizers of the product quantizer. The $j$ th lookup table comprises the distance between the $j$ sub-vector of $y^{\prime}$ and all centroids of the $j$ th sub-quantizer:

[TABLE]

Scan. In this step, the cells of the inverted index selected during the step Index are searched for nearest neighbors. This requires computing the distance between the query vectors and short codes using Asymmetric Distance Computation (ADC). ADC computes the distance between the query vector $y$ and a short code $c$ as follows:

[TABLE]

Equation 2 is equivalent to:

[TABLE]

Thus, ADC computes the distance between a query vector $y^{\prime}$ and a code $c$ by summing the distances between the sub-vectors of $y^{\prime}$ and centroids associated with code $c$ in the $m$ sub-spaces of the product quantizer. When the number of codes in cells is large compared to $k$ , the number of centroids of sub-quantizers, using lookup tables avoids computing $\lVert{y^{\prime}}^{j}-\mathcal{C}^{j}[i]\rVert^{2}$ for the same $i$ multiple times. Thus, lookup tables therefore provide a significant speedup. While scanning inverted lists, neighbors and their associated distances are stored in a binary heap of size $R$ (Algorithm 1, line 6).

2.4. Impact of PQ Parameters

The two parameters of a product quantizer, $m$ , the number of sub-quantizers and $k$ , the number of centroids of each sub-quantizer impact: (1) the memory usage of codes, (2) the recall of ANN search and (3) search speed. In practice, 64-bit codes ( $2^{64}$ centroids) or 128-bit codes ( $2^{128}$ centroids) are used in most cases.

The second tradeoff is between ANN accuracy and search speed. For a constant memory budget of $m\cdot b$ bits per code, the respective values of $m$ and $b$ impact accuracy and speed. Decreasing $m$ , which implies increasing $b$ , increases accuracy (Jégou et al., 2011). We discuss the effect of $m$ and $b$ on the time cost of the Tables and Scan steps of ANN search (Section 1). Each lookup table requires $k=2^{b}$ $l_{2}$ -norm computations in sub-spaces of dimensionality $d/m$ . Thus, the complexity of computing all $m$ lookup tables is $O(m\cdot 2^{b}\cdot d/m)=O(2^{b}\cdot d)$ , and increases exponentially with $b$ . In conclusion, decreasing $m$ makes the Tables step more costly.

During the Scan step, each Asymmetric Distance Computation (ADC) (Algorithm 1, line 12) requires $m$ accesses to lookup tables and $m$ additions (Algorithm 1, line 15). Therefore, decreasing $m$ decreases the number of operations required for each ADC, which is beneficial for search speed. However, decreasing $m$ implies increasing $b$ , and thus increasing the size of lookup tables. The size of all lookup tables $\{D^{j}\}_{j=0}^{m}$ is $m\cdot k\cdot\operatorname{sizeof}(\mathrm{float})=m\cdot 2^{b}\cdot 4$ . It increases linearly with $m$ and exponentially with $b$ . Thus, decreasing $m$ increases the size of lookup tables. As the size of lookup tables increases, they need to be stored in larger and slower cache levels which is detrimental to performance (André et al., 2015). In conclusion, decreasing $m$ , makes the Tables step less costly, except if it causes lookup tables to be stored in slower cache.

To illustrate this, we measure the recall (R@100) and the time cost of the Tables and Scan steps of ANN search for different $m{{\times}}b$ configurations producing 64-bit codes (Table 1). For $16{{\times}}4$ and $8{{\times}}8$ , tables fit the L1 cache. The $8{{\times}}8$ configuration has a lower Scan time because it requires less additions and less accesses to lookup tables. The $4{{\times}}16$ configuration requires even less additions and table accesses but lookup tables are stored in the much slower L3 cache. Overall, the $4{{\times}}16$ configuration therefore has a higher Scan time. In all cases, the time cost of the Tables step increases with $b$ .

3. Quick ADC

3.1. Overview

The performance gains of Quick ADC are achieved by exploiting SIMD. Single Instruction Multiple Data (SIMD) instructions perform the same operation e.g., additions, on multiple data elements in one instruction. Consequently, SIMD enables large performance improvements. Thus, optimized linear algebra libraries rely on SIMD to offer high performance. Current CPUs include an SIMD unit in each core. SIMD therefore offers an additional level of parallelism over multi-core processing. ANN search parallelizes naturally over multiple cores by processing a distinct query on each core. With Quick ADC, we propose further increasing performance by speeding up ADC for each query, thanks to the use of SIMD. To process multiple data elements at once, SIMD instructions operate on wide registers. SSE instructions use 128-bit registers, while the newer AVX instructions use 256-bit registers.

The Scan step computes asymmetric distances between the query vector and all codes stored in selected cells. Each ADC requires (1) $m$ accesses to cache-resident lookup tables and (2) $m$ additions. If implementing additions using SIMD is straightforward, SIMD does not allow an efficient implementation of table lookup, even using gather instructions introduced in recent processors (André et al., 2015; Hofmann et al., 2014). SIMD can add 4 floating-point numbers (128 bits) or 8 floating-point numbers (256 bits) at once, there are only 2 cache read ports in each CPU core. Therefore, it is not possible to perform more than 2 cache accesses concurrently.

Therefore, efficiently implementing ADC using SIMD requires storing lookup tables in SIMD registers and performing lookups using SIMD in-register shuffles. The main challenge is that SIMD registers (128 bits) are much smaller than lookup tables, for common PQ configurations. In most cases, product quantizers use 8-bit sub-quantizers, which results in lookup tables of $k=2^{8}=256$ floats (8192 bits). For this reason, Quick ADC relies on (i) the use of 4-bit quantizers instead of the common 8-bit quantizers, and (ii) the quantization of floats to 8-bit integers. We obtain lookup tables of $k=2^{4}=16$ floats, which are then quantized to 8-bit integers. The resulting lookup tables comprise 16 8-bit integers (128 bits), and can be stored in SIMD registers. Once lookup tables are stored in SIMD registers, in-register shuffles can be used to perform 16 lookups in 1 cycle (Figure 1), enabling large performance gains.

In addition to the use of 4-bit quantizers and the quantization of floats to 8-bit integers, Quick ADC requires a minor change of memory layout. In the next sections, we detail this change of memory layout as well as our lookup tables quantization process and the SIMD implementation of distance computations.

3.2. Memory Layout

An SIMD in-register shuffle performs 16 lookups at once, but in a single lookup table e.g., $D^{0}$ (Figure 1). Therefore, to use shuffles efficiently, we need to operate on the first component of 16 codes ( $a_{0},\dotsc,p_{0}$ ) at once instead of the 16 components of a single code ( $a_{0},\dotsc,a_{15}$ ). Its is crucial for efficiency that all values in an SIMD register can be loaded in a single memory read. This requires that $a_{0},\dotsc,p_{0}$ are contiguous in memory, which is not the case with the standard memory layout of inverted lists (Figure 2a). We therefore transpose inverted lists by blocks of 16 codes, so that analogous components of 16 codes are stored in adjacent bytes (Figure 2b). We divide each inverted list in blocks of 16 codes and transpose each block independently. Figure 2 shows the transposition of one block of 16 codes ( $a,\dotsc,p$ ). This transposition is performed offline, and does not increase ANN query response time. The transposition is moreover very fast; the overhead on database creation time is less than 1%.

3.3. Quantization of Lookup Tables

In standard ADC, lookup tables store 32-bit floats. To be able to store tables of 16 elements in 128-bit registers, we quantize 32-bit floats to 8-bit integers using a scalar quantizer. Because there is no SIMD instruction to compare unsigned 8-bit integers, we quantize distances to signed 8-bit integers, only using their positive range. We quantize distances between a $\mathit{qmin}$ and $\mathit{qmax}$ bound into $n=127$ bins (0-126) uniformly. The size of each bin is $\Delta=(\mathit{qmax}-\mathit{qmin})/n$ . Values larger than $\mathit{qmax}$ are quantized to 127.

We choose the minimum value accross all lookup tables $\{D^{j}\}_{j=0}^{m}$ , which is the smallest distance we need to represent, as the $\mathit{qmin}$ value. Using the maximum possible distance i.e., the sum of the maximums of all lookup tables results in a too high quantization error. Therefore, to set $\mathit{qmax}$ we scan $\mathit{init}$ vectors (typical $\mathit{init}$ =200-1000) to find a temporary set of $R$ nearest neighbor candidates, where $R$ is the number of nearest neighbors requested by the user (Section 1). We use the distance of the query vector to the $R$ th nearest neighbor candidate i.e., the farthest nearest neighbor candidate, as the $\mathit{qmax}$ bound. All subsequent candidates will need to be closer to the query vector, thus $\mathit{qmax}$ is the maximum distance we need to represent.

3.4. SIMD Distance Computation

Although recent Intel CPUs offer 256-bit SIMD, we describe a version of Quick ADC which uses 128-bit SIMD for the sake of simplicity. Yet, we explain how to generalize it to 256-bit at the end of the section. Moreover, the 128-bit version of Quick ADC offers the best compatibility, notably with older Intel CPUs or ARM CPUs. In Algortihm 2, SIMD instructions are denoted by the prefix simd_. SIMD instructions use 128-bit variables, denoted by r128.

The quick_adc_scan function (Algorithm 2, line 12) scans a block-transposed inverted list $\mathit{tlist}$ (Section 3.2) using $m$ quantized lookup tables $\{D^{j}\}_{j=0}^{m-1}$ , where $m$ is the number of sub-quantizers of the product quantizer. Each lookup table is stored in a distinct SIMD register. The quick_adc_scan function iterates over blocks $\mathit{blk}$ of 16 codes (Algorithm 2, line 14). The quick_adc_block function computes the distance between the query vector and the 16 codes ( $a,\dotsc,p$ ) of the block $\mathit{blk}$ .

Each block comprises $m/2$ rows of 16 bytes (128 bits). Each row stores the $j$ th and $(j+1)$ th components of 16 codes (Figure 2b). The quick_adc_block function iterates over each row (Alorithm 2, line 7), and loads it in the $\mathit{comps}$ register sequentially (Algorithm 2, line 8). Two lookup-add operations are performed on each row (Algorithm 2, line 9 and line 11): one for the $(2j)$ th components, and one for $(2j+1)$ th components of the codes. Figure 3 describes the succession of operations performed by the lookup_add function for the first row ( $j=0$ ). As each byte of the first row stores two components, e.g., the first byte of the first row stores $a_{1}$ and $a_{0}$ (Figure 3), we start by masking the lower 4 bits of each byte (and with 0x0f), to obtain the first components ( $a_{0},\dotsc,p_{0}$ ) only. The remainder of the function looks up values in the $D^{0}$ table and accumulates distances in $\mathit{acc}$ variable. Before the lookup_add function can be used to process the second components ( $a_{1},\dotsc,p_{1}$ ), it is necessary that ( $a_{1},\dotsc,p_{1}$ ) are in the lowest 4 bits of each byte of the register. We therefore right shift the $\mathit{comps}$ register by 4 bits (Figure 4) before calling lookup_add (Algorithm 2, line 10). The extract_matches function (Algorithm 2, line 16), the implementation of which is not shown, extracts distances from the $\mathit{acc}$ register and inserts them in the binary heap $\mathit{neighbors}$ .

Among 256-bit SIMD instructions (AVX and AVX2 instruction sets) supported on recent CPUs, some, like in-register shuffles, operate concurrently on two independent 128-bit lanes. This prevents use of 256-bit lookup tables (32 8-bit integers) but allows an easy generalization of the 128-bit version of Quick ADC. While the 128-bit version of Quick ADC iterates on block rows one by one (Algorithm 2, line 7), the 256-bit version processes two rows at once: one row in each 128-bit lane. The number of iterations is thus reduced from $m/2$ to $m/4$ . Lastly, instead of storing each $D^{j}$ table in a distinct 128-bit register, the tables $D^{j}$ and $D^{2j}$ , $j\in\{0,\dotsc,m/2-1\}$ , are stored in each of the two lanes of a 256-bit register.

4. Evaluation

4.1. Experimental Setup

We implemented 256-bit Quick ADC in C++, using compiler intrinsics to access SIMD instructions. Our implementation is released under the Clear BSD license111https://github.com/technicolor-research/quick-adc and uses the AVX and AVX2 instruction sets. We used the g++ compiler version 5.3, with the options -03 -ffast-math -m64 -march=native. Exhaustive search and non-exhaustive search (inverted indexes, IVF) were implemented as described in (Jégou et al., 2011). We use the yael library and the ATLAS library version 3.10.2. We compiled an optimized version of ATLAS on our system. To learn product quantizers and optimized product quantizers, we used the implementation 222https://github.com/arbabenko/Quantizations of the authors of (Babenko and Lempitsky, 2015b, 2014). Unless otherwise noted, experiments were performed on our workstation (Table 2). To get accurate timings, we processed queries sequentially on a single core. We evaluate our approach on two publicly available333http://corpus-texmex.irisa.fr/ datasets of SIFT descriptors, one dataset of GIST descriptors, and one dataset of PCA-compressed deep features444http://sites.skoltech.ru/compvision/projects/aqtq/ (Table 3). For SIFT1B, the learning set is needlessly large to train product quantizers, so we used the first 2 million vectors. We used a query set of 1000 vectors for all experiments.

4.2. Exhaustive Search in SIFT1M

Using $16{{\times}}4$ Quick ADC (QADC) instead of $8{{\times}}8$ ADC offers a large performance gain, thanks to the use of SIMD in-register shuffles. It however also causes a decrease in recall which is cause by two factors: (1) use of $16{{\times}}4$ quantizers instead of $8{{\times}}8$ quantizers (Section 2.4) and (2) use of quantized lookup tables (Section 3.3). In this section, we evaluate the global decrease in recall caused by the use of $16{{\times}}4$ QADC instead of $8{{\times}}8$ ADC, but also the relative impact of factors (1) and (2). To do so, we use the SIFT1M dataset and follow an exhaustive search strategy. We do not use an inverted index and we encode the original vectors into short codes, not residuals. This maximizes quantization error and thus represents a worst-case scenario for QADC. We scan $\mathit{init}=200$ vectors to set the $\mathit{qmax}$ bound for quantization of lookup tables (Section 3.3).

We observe that $16{{\times}}4$ ADC slightly decreases recall (Figure 5a). However, $16{{\times}}4$ QADC, which uses quantized lookup tables, does not further decrease recall in comparison with $16{{\times}}4$ ADC. OPQ yields better results than PQ in all cases (Figure 5b), which is consistent with (Norouzi and Fleet, 2013; Ge et al., 2014). Moreover, the difference in recall between $8{{\times}}8$ ADC and $16{{\times}}4$ QADC is lower for OPQ than it is for PQ. OPQ optimizes the decomposition of the input vector space into $m$ sub-spaces, which are used by the optimized product quantizer (Section 2.1). For $m=16$ , OPQ has more degrees of freedom than for $m=8$ and is therefore able to bring a greater level of optimization.

For an exhaustive search in 1 million vectors, $16{{\times}}4$ QADC is ${\sim}14$ times faster than $16{{\times}}4$ ADC and $6$ times faster than $8{{\times}}8$ ADC (Figure 5c) (85% decrease in response time). Response times for PQ and OPQ are similar, so we report results for PQ. In practice, $8{{\times}}8$ ADC is much more common than $16{{\times}}4$ ADC (Babenko and Lempitsky, 2014, 2015b, 2015a; Norouzi and Fleet, 2013; Zhang et al., 2014), thus we only compare $16{{\times}}4$ QADC with $8{{\times}}8$ ADC in the remainder of this section. Overall, QADC therefore proposes trading a small decrease in recall, for a large improvement in response time.

Non-exhaustive search offers both a lower response time and a higher recall than exhaustive search (Section 2.2). For this reason, non-exhaustive search is preferred to exhaustive search in practical systems. Therefore, in the remainder of this section, we evaluate QADC in the context of non-exhaustive search, for a wide range of scenarios: SIFT, GIST descriptors, deep feature, PQ and OPQ, 64 and 128 bit codes. We show that in most cases, when combined with OPQ and inverted indexes, QADC offers a decrease in response time close to 70% for a small or negligible loss of accuracy.

4.3. Non-exhaustive Search in SIFT1M

Table 4.3 compares the Recall@100 (R@100) and total ANN search time (Total). The time spent in each of the search steps (Index, Tables, and Scan) detailed in Section 1 is also reported. All times are in milliseconds (ms). OPQ requires a rotation of the input vector before computing lookup tables (Section 2.1). We include the time to perform this rotation in the Tables column. When using inverted indexes, the parameters $K$ , the total number of cells of the inverted index, and $\mathit{ma}$ , the number of cells scanned to answer a query, impact response time and recall (Section 1). For datasets of 1 million vectors, we have found the parameters $\mathit{ma}=24$ and $K=256$ to offer the best tradeoff.

For this configuration, QADC offers a 75% decrease in scan time. In addition, QADC offers a 50-70% decrease in tables computation time, thanks to the use of 4-bit quantizers, which result in smaller and faster to compute small tables. Overall, this translates into a decrease of approximately 70% in total response time. The loss of recall is significantly lower with OPQ (-1.5%) than with PQ (-4.4%), as OPQ offers a lower quantization error than PQ.

Bibliography16

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2André et al . (2015) Fabien André, Anne-Marie Kermarrec, and Nicolas Le Scouarnec. 2015. Cache locality is not enough: High-Performance Nearest Neighbor Search with Product Quantization Fast Scan. PVLDB 9, 4 (2015).
3Babenko and Lempitsky (2014) Artem Babenko and Victor Lempitsky. 2014. Additive Quantization for Extreme Vector Compression. In CVPR .
4Babenko and Lempitsky (2015 a) Artem Babenko and Victor Lempitsky. 2015 a. The Inverted Multi-Index. TPAMI 37, 6 (2015).
5Babenko and Lempitsky (2015 b) Artem Babenko and Victor Lempitsky. 2015 b. Tree Quantization for Large-Scale Similarity Search and Classification. In CVPR .
6Ge et al . (2014) Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2014. Optimized Product Quantization. TPAMI 36, 4 (2014).
7Hofmann et al . (2014) Johannes Hofmann, Jan Treibig, Georg Hager, and Gerhard Wellein. 2014. Comparing the Performance of Different x 86 SIMD Instruction Sets for a Medical Imaging Application on Modern Multi- and Manycore Chips. In WPMVP .
8Jégou et al . (2011) Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. TPAMI 33, 1 (2011).