Composite Code Sparse Autoencoders for first stage retrieval
Carlos Lassance, Thibault Formal, Stephane Clinchant

TL;DR
This paper introduces the Composite Code Sparse Autoencoder (CCSA), a novel method for efficient approximate nearest neighbor search in document retrieval, combining sparse coding, binary quantization, and graph-based search techniques to improve speed and accuracy.
Contribution
The paper presents CCSA, a new autoencoder-based approach that enhances indexing and retrieval efficiency for dense vector representations in information retrieval systems.
Findings
CCSA outperforms IVF with product quantization on MSMARCO dataset.
Binary quantization with CCSA reduces index size and memory usage.
CCSA surpasses recent supervised quantization methods in image retrieval.
Abstract
We propose a Composite Code Sparse Autoencoder (CCSA) approach for Approximate Nearest Neighbor (ANN) search of document representations based on Siamese-BERT models. In Information Retrieval (IR), the ranking pipeline is generally decomposed in two stages: the first stage focus on retrieving a candidate set from the whole collection. The second stage re-ranks the candidate set by relying on more complex models. Recently, Siamese-BERT models have been used as first stage ranker to replace or complement the traditional bag-of-word models. However, indexing and searching a large document collection require efficient similarity search on dense vectors and this is why ANN techniques come into play. Since composite codes are naturally sparse, we first show how CCSA can learn efficient parallel inverted index thanks to an uniformity regularizer. Second, CCSA can be used as a binary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsSparse Autoencoder
