Association via Entropy Reduction

Anthony Gamst; Lawrence Wilson

arXiv:2511.04901·cs.IR·November 10, 2025

Association via Entropy Reduction

Anthony Gamst, Lawrence Wilson

PDF

Open Access

TL;DR

This paper introduces a new association score called 'aver' based on entropy reduction, which outperforms tf-idf in identifying related document pairs, especially in large collections and graph-based contexts.

Contribution

The paper proposes 'aver', a novel entropy-based association measure, demonstrating its advantages over tf-idf in various scenarios and highlighting its theoretical and practical benefits.

Findings

01

Aver outperforms tf-idf in identifying associated pairs.

02

Aver has a natural threshold for unassociated pairs.

03

Aver can be applied to larger document collections.

Abstract

Prior to recent successes using neural networks, term frequency-inverse document frequency (tf-idf) was clearly regarded as the best choice for identifying documents related to a query. We provide a different score, aver, and observe, on a dataset with ground truth marking for association, that aver does do better at finding assciated pairs than tf-idf. This example involves finding associated vertices in a large graph and that may be an area where neural networks are not currently an obvious best choice. Beyond this one anecdote, we observe that (1) aver has a natural threshold for declaring pairs as unassociated while tf-idf does not, (2) aver can distinguish between pairs of documents for which tf-idf gives a score of 1.0, (3) aver can be applied to larger collections of documents than pairs while tf-idf cannot, and (4) that aver is derived from entropy under a simple statistical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Topic Modeling · Information Retrieval and Search Behavior