Improving Term Frequency Normalization for Multi-topical Documents, and Application to Language Modeling Approaches
Seung-Hoon Na, In-Su Kang, Jong-Hyeok Lee

TL;DR
This paper introduces a new term frequency normalization method that accounts for verbosity and multi-topicality in documents, improving language modeling and retrieval precision.
Contribution
It proposes a partially-axiomatic TF normalization approach that differentiates between verbosity and multi-topicality, enhancing language modeling techniques.
Findings
Significant increase in keyword query precision
Substantial improvement in MAP for verbose queries
Better handling of document length variations
Abstract
Term frequency normalization is a serious issue since lengths of documents are various. Generally, documents become long due to two different reasons - verbosity and multi-topicality. First, verbosity means that the same topic is repeatedly mentioned by terms related to the topic, so that term frequency is more increased than the well-summarized one. Second, multi-topicality indicates that a document has a broad discussion of multi-topics, rather than single topic. Although these document characteristics should be differently handled, all previous methods of term frequency normalization have ignored these differences and have used a simplified length-driven approach which decreases the term frequency by only the length of a document, causing an unreasonable penalization. To attack this problem, we propose a novel TF normalization method which is a type of partially-axiomatic approach.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
