Investigating Mixture of Experts in Dense Retrieval

Effrosyni Sokli; Pranav Kasela; Georgios Peikos; Gabriella Pasi

arXiv:2412.11864·cs.IR·December 17, 2024

Investigating Mixture of Experts in Dense Retrieval

Effrosyni Sokli, Pranav Kasela, Georgios Peikos, Gabriella Pasi

PDF

Open Access

TL;DR

This paper explores the integration of a single Mixture-of-Experts (MoE) block after the final Transformer layer in Dense Retrieval Models to enhance their effectiveness and robustness across various benchmarks.

Contribution

It introduces and empirically evaluates a novel SB-MoE architecture applied after the last Transformer layer in DRMs, comparing its performance to standard fine-tuning.

Findings

01

SB-MoE improves retrieval effectiveness for small models like TinyBERT.

02

Larger models like BERT and Contriever need more training data for SB-MoE to outperform fine-tuning.

03

SB-MoE's performance varies with the number of experts and model size.

Abstract

While Dense Retrieval Models (DRMs) have advanced Information Retrieval (IR), one limitation of these neural models is their narrow generalizability and robustness. To cope with this issue, one can leverage the Mixture-of-Experts (MoE) architecture. While previous IR studies have incorporated MoE architectures within the Transformer layers of DRMs, our work investigates an architecture that integrates a single MoE block (SB-MoE) after the output of the final Transformer layer. Our empirical evaluation investigates how SB-MoE compares, in terms of retrieval effectiveness, to standard fine-tuning. In detail, we fine-tune three DRMs (TinyBERT, BERT, and Contriever) across four benchmark collections with and without adding the MoE block. Moreover, since MoE showcases performance variations with respect to its parameters (i.e., the number of experts), we conduct additional experiments to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExpert finding and Q&A systems

MethodsAttention Dropout · Linear Layer · Linear Warmup With Linear Decay · Weight Decay · WordPiece · Adam · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · Refunds@Expedia|||How do I get a full refund from Expedia?