On the Normalization and Visualization of Author Co-Citation Data Salton's Cosine versus the Jaccard Index
Loet Leydesdorff

TL;DR
This paper compares Salton's Cosine and the Jaccard Index for normalizing author co-citation data, highlighting the advantages of the Jaccard Index in web environments and its focus on set intersection.
Contribution
It introduces the use of the Jaccard index with citation counts for better normalization in author co-citation analysis, especially when original citation data is unavailable.
Findings
Jaccard index focuses on set intersection, reducing spurious correlations.
Adding total citations to the diagonal improves Jaccard's effectiveness.
Jaccard index is advantageous in web-based co-citation analysis.
Abstract
The debate about which similarity measure one should use for the normalization in the case of Author Co-citation Analysis (ACA) is further complicated when one distinguishes between the symmetrical co-citation--or, more generally, co-occurrence--matrix and the underlying asymmetrical citation--occurrence--matrix. In the Web environment, the approach of retrieving original citation data is often not feasible. In that case, one should use the Jaccard index, but preferentially after adding the number of total citations (occurrences) on the main diagonal. Unlike Salton's cosine and the Pearson correlation, the Jaccard index abstracts from the shape of the distributions and focuses only on the intersection and the sum of the two sets. Since the correlations in the co-occurrence matrix may partially be spurious, this property of the Jaccard index can be considered as an advantage in this case.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Semantic Web and Ontologies · Data Visualization and Analytics
