Two-Stage Document Length Normalization for Information Retrieval

Seung-Hoon Na

arXiv:1502.04331·cs.IR·February 17, 2015

Two-Stage Document Length Normalization for Information Retrieval

Seung-Hoon Na

PDF

Open Access

TL;DR

This paper introduces a two-stage document length normalization method that separately normalizes verbosity and scope, improving retrieval performance by addressing limitations of standard length normalization.

Contribution

It proposes a novel verbosity and scope separation approach with different penalization functions, leading to the formulation of a new verbosity normalized retrieval model.

Findings

01

VN model shows statistically significant improvements

02

Outperforms standard retrieval models on TREC collections

03

Addresses verbosity and scope effects separately

Abstract

The standard approach for term frequency normalization is based only on the document length. However, it does not distinguish the verbosity from the scope, these being the two main factors determining the document length. Because the verbosity and scope have largely different effects on the increase in term frequency, the standard approach can easily suffer from insufficient or excessive penalization depending on the specific type of long document. To overcome these problems, this paper proposes two-stage normalization by performing verbosity and scope normalization separately, and by employing different penalization functions. In verbosity normalization, each document is pre-normalized by dividing the term frequency by the verbosity of the document. In scope normalization, an existing retrieval model is applied in a straightforward manner to the pre-normalized document, finally leading…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Web Data Mining and Analysis · Information Retrieval and Search Behavior