Two-Stage Document Length Normalization for Information Retrieval
Seung-Hoon Na

TL;DR
This paper introduces a two-stage document length normalization method that separately normalizes verbosity and scope, improving retrieval performance by addressing limitations of standard length normalization.
Contribution
It proposes a novel verbosity and scope separation approach with different penalization functions, leading to the formulation of a new verbosity normalized retrieval model.
Findings
VN model shows statistically significant improvements
Outperforms standard retrieval models on TREC collections
Addresses verbosity and scope effects separately
Abstract
The standard approach for term frequency normalization is based only on the document length. However, it does not distinguish the verbosity from the scope, these being the two main factors determining the document length. Because the verbosity and scope have largely different effects on the increase in term frequency, the standard approach can easily suffer from insufficient or excessive penalization depending on the specific type of long document. To overcome these problems, this paper proposes two-stage normalization by performing verbosity and scope normalization separately, and by employing different penalization functions. In verbosity normalization, each document is pre-normalized by dividing the term frequency by the verbosity of the document. In scope normalization, an existing retrieval model is applied in a straightforward manner to the pre-normalized document, finally leading…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Web Data Mining and Analysis · Information Retrieval and Search Behavior
