SpaDE: Improving Sparse Representations using a Dual Document Encoder   for First-stage Retrieval

Eunseong Choi; Sunkyung Lee; Minjin Choi; Hyeseon Ko; Young-In Song; and Jongwuk Lee

arXiv:2209.05917·cs.IR·October 6, 2023

SpaDE: Improving Sparse Representations using a Dual Document Encoder for First-stage Retrieval

Eunseong Choi, Sunkyung Lee, Minjin Choi, Hyeseon Ko, Young-In Song, and Jongwuk Lee

PDF

2 Repos

TL;DR

SpaDE is a novel dual encoder model that enhances sparse document representations for first-stage retrieval, balancing lexical and semantic matching while maintaining efficiency.

Contribution

It introduces a dual encoder approach with co-training to improve sparse representations, addressing vocabulary mismatch without high inference costs.

Findings

01

SpaDE outperforms existing uni-encoder models on multiple benchmarks.

02

It effectively balances lexical and semantic matching.

03

The co-training strategy enhances training efficiency and effectiveness.

Abstract

Sparse document representations have been widely used to retrieve relevant documents via exact lexical matching. Owing to the pre-computed inverted index, it supports fast ad-hoc search but incurs the vocabulary mismatch problem. Although recent neural ranking models using pre-trained language models can address this problem, they usually require expensive query inference costs, implying the trade-off between effectiveness and efficiency. Tackling the trade-off, we propose a novel uni-encoder ranking model, Sparse retriever using a Dual document Encoder (SpaDE), learning document representation via the dual encoder. Each encoder plays a central role in (i) adjusting the importance of terms to improve lexical matching and (ii) expanding additional terms to support semantic matching. Furthermore, our co-training strategy trains the dual encoder effectively and avoids unnecessary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.