Testing different Log Bases For Vector Model Weighting Technique

Kamel Assaf

arXiv:2307.06213·cs.IR·July 13, 2023

Testing different Log Bases For Vector Model Weighting Technique

Kamel Assaf

PDF

TL;DR

This paper investigates how using different logarithm bases in the IDF component of TFIDF affects information retrieval performance, testing bases from 0.1 to 100 across multiple datasets.

Contribution

It introduces a systematic analysis of the impact of various log bases on TFIDF weighting in IR systems, which is a novel exploration.

Findings

01

Different log bases significantly influence retrieval effectiveness.

02

Optimal log base varies depending on dataset and context.

03

Using non-standard bases can improve retrieval performance in some cases.

Abstract

Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsBalanced Selection