Query-driven Segment Selection for Ranking Long Documents

Youngwoo Kim; Razieh Rahimi; Hamed Bonab; James Allan

arXiv:2109.04611·cs.IR·September 13, 2021

Query-driven Segment Selection for Ranking Long Documents

Youngwoo Kim, Razieh Rahimi, Hamed Bonab, James Allan

PDF

TL;DR

This paper introduces a query-driven segment selection method for training transformer-based rankers on long documents, improving relevance detection and performance over heuristic methods.

Contribution

It proposes a novel query-driven segment selection approach that enhances training data quality for long document ranking with transformers.

Findings

01

Significantly outperforms heuristic segment selection in ranking accuracy.

02

Performs comparably to state-of-the-art models with localized self-attention.

03

Enables more efficient training of transformer rankers on long documents.

Abstract

Transformer-based rankers have shown state-of-the-art performance. However, their self-attention operation is mostly unable to process long sequences. One of the common approaches to train these rankers is to heuristically select some segments of each document, such as the first segment, as training data. However, these segments may not contain the query-related parts of documents. To address this problem, we propose query-driven segment selection from long documents to build training data. The segment selector provides relevant samples with more accurate labels and non-relevant samples which are harder to be predicted. The experimental results show that the basic BERT-based ranker trained with the proposed segment selector significantly outperforms that trained by the heuristically selected segments, and performs equally to the state-of-the-art model with localized self-attention that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.