TL;DR
This paper investigates internal attention signals within transformer layers for zero-shot re-ranking, revealing a universal relevance distribution and proposing a Selective-ICR method that improves efficiency and effectiveness.
Contribution
It provides a comprehensive layer-wise analysis of internal attention, introduces a universal relevance distribution, and proposes a Selective-ICR strategy that enhances re-ranking efficiency without sacrificing performance.
Findings
A universal bell-curve distribution of relevance signals across transformer layers.
Selective-ICR reduces inference latency by 30%-50%.
A zero-shot 8B model matches or outperforms larger models and state-of-the-art methods.
Abstract
Zero-shot document re-ranking with Large Language Models (LLMs) has evolved from Pointwise methods to Listwise and Setwise approaches that optimize computational efficiency. Despite their success, these methods predominantly rely on generative scoring or output logits, which face bottlenecks in inference latency and result consistency. In-Context Re-ranking (ICR) has recently been proposed as an O(1) alternative method. ICR extracts internal attention signals directly, avoiding the overhead of text generation. However, existing ICR methods simply aggregate signals across all layers; layer-wise contributions and their consistency across architectures have been left unexplored. Furthermore, no unified study has compared internal attention with traditional generative and likelihood-based mechanisms across diverse ranking frameworks under consistent conditions. In this paper, we conduct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
