Composite Code Sparse Autoencoders for first stage retrieval

Carlos Lassance; Thibault Formal; Stephane Clinchant

arXiv:2204.07023·cs.IR·April 15, 2022

Composite Code Sparse Autoencoders for first stage retrieval

Carlos Lassance, Thibault Formal, Stephane Clinchant

PDF

Open Access

TL;DR

This paper introduces the Composite Code Sparse Autoencoder (CCSA), a novel method for efficient approximate nearest neighbor search in document retrieval, combining sparse coding, binary quantization, and graph-based search techniques to improve speed and accuracy.

Contribution

The paper presents CCSA, a new autoencoder-based approach that enhances indexing and retrieval efficiency for dense vector representations in information retrieval systems.

Findings

01

CCSA outperforms IVF with product quantization on MSMARCO dataset.

02

Binary quantization with CCSA reduces index size and memory usage.

03

CCSA surpasses recent supervised quantization methods in image retrieval.

Abstract

We propose a Composite Code Sparse Autoencoder (CCSA) approach for Approximate Nearest Neighbor (ANN) search of document representations based on Siamese-BERT models. In Information Retrieval (IR), the ranking pipeline is generally decomposed in two stages: the first stage focus on retrieving a candidate set from the whole collection. The second stage re-ranks the candidate set by relying on more complex models. Recently, Siamese-BERT models have been used as first stage ranker to replace or complement the traditional bag-of-word models. However, indexing and searching a large document collection require efficient similarity search on dense vectors and this is why ANN techniques come into play. Since composite codes are naturally sparse, we first show how CCSA can learn efficient parallel inverted index thanks to an uniformity regularizer. Second, CCSA can be used as a binary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsSparse Autoencoder