Layer-wise Token Compression for Efficient Document Reranking
Shengyao Zhuang, Zhichao Xu, Ivano Lauriola

TL;DR
This paper introduces Layer-wise Token Compression (LTC), a method that applies adaptive token pooling at intermediate transformer layers to improve the inference speed of document rerankers without sacrificing ranking quality.
Contribution
The paper proposes LTC, a novel approach for token compression at intermediate layers, demonstrating significant inference speed gains and improved regularization effects for document reranking models.
Findings
LTC increases inference QPS by up to 25% for passage ranking.
LTC increases inference QPS by up to 116% for document ranking.
Models trained with LTC outperform uncompressed models on long-document tasks.
Abstract
Transformer-based document cross-encoder rerankers are a central component of modern information retrieval systems. Despite their success, these models suffer from high computational costs due to processing long query-document sequences at inference time. A known approach to improve efficiency is token compression, which consists of aggregating groups of tokens together in the initial embedding layer, reducing the effective number of tokens, and making the computation faster. While token compression has proven to be successful for bi-encoder retrievers, we empirically observed that this approach may be ineffective for cross-encoder rerankers. In this paper, we propose Layer-wise Token Compression (LTC), which applies adaptive token pooling at intermediate transformer layers. Through extensive ablation studies on MS MARCO passage and document ranking tasks, we demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
