On the distribution of cosine similarity with application to biology
Ian Smith, Janosch Ortmann, Farnoosh Abbas-Aghababazadeh, Petr, Smirnov, Benjamin Haibe-Kains

TL;DR
This paper derives the asymptotic moments of cosine similarity as a function of data covariance, providing insights to optimize similarity measures in biological data analysis and beyond.
Contribution
It introduces a theoretical framework linking data covariance to cosine similarity distribution, enabling optimization of similarity measures for biological and other data types.
Findings
Variance of cosine similarity minimized with equal eigenvalues of covariance matrix
Derived asymptotic moments of cosine similarity based on data properties
Application to optimize similarity measures in noisy biological datasets
Abstract
Cosine similarity is an established similarity metric for computing associations on vectors, and it is commonly used to identify related samples from biological perturbational data. The distribution of cosine similarity changes with the covariance of the data, and this in turn affects the statistical power to identify related signals. The relationship between the mean and covariance of the distribution of the data and the distribution of cosine similarity is poorly understood. In this work, we derive the asymptotic moments of cosine similarity as a function of the data and identify the criteria of the data covariance matrix that minimize the variance of cosine similarity. We find that the variance of cosine similarity is minimized when the eigenvalues of the covariance matrix are equal for centered data. One immediate application of this work is characterizing the null distribution of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBioinformatics and Genomic Networks · Genetic Associations and Epidemiology
