Association via Entropy Reduction
Anthony Gamst, Lawrence Wilson

TL;DR
This paper introduces a new association score called 'aver' based on entropy reduction, which outperforms tf-idf in identifying related document pairs, especially in large collections and graph-based contexts.
Contribution
The paper proposes 'aver', a novel entropy-based association measure, demonstrating its advantages over tf-idf in various scenarios and highlighting its theoretical and practical benefits.
Findings
Aver outperforms tf-idf in identifying associated pairs.
Aver has a natural threshold for unassociated pairs.
Aver can be applied to larger document collections.
Abstract
Prior to recent successes using neural networks, term frequency-inverse document frequency (tf-idf) was clearly regarded as the best choice for identifying documents related to a query. We provide a different score, aver, and observe, on a dataset with ground truth marking for association, that aver does do better at finding assciated pairs than tf-idf. This example involves finding associated vertices in a large graph and that may be an area where neural networks are not currently an obvious best choice. Beyond this one anecdote, we observe that (1) aver has a natural threshold for declaring pairs as unassociated while tf-idf does not, (2) aver can distinguish between pairs of documents for which tf-idf gives a score of 1.0, (3) aver can be applied to larger collections of documents than pairs while tf-idf cannot, and (4) that aver is derived from entropy under a simple statistical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Information Retrieval and Search Behavior
