Boosting t-SNE Efficiency for Sequencing Data: Insights from Kernel Selection
Avais Jan, Prakash Chourasia, Sarwan Ali, Murray Patterson

TL;DR
This paper evaluates nine kernel functions for t-SNE in biological sequence analysis, finding that cosine similarity offers superior efficiency and data preservation over traditional kernels across multiple datasets and embedding methods.
Contribution
It provides a comprehensive comparison of kernel functions for t-SNE on sequencing data, highlighting the effectiveness of cosine similarity for visualization and downstream tasks.
Findings
Cosine similarity kernel outperforms Gaussian and isolation kernels in visualization quality.
Cosine kernel achieves better runtime efficiency and pairwise distance preservation.
Kernel choice significantly affects downstream classification and clustering performance.
Abstract
Dimensionality reduction techniques are essential for visualizing and analyzing high-dimensional biological sequencing data. t-distributed Stochastic Neighbor Embedding (t-SNE) is widely used for this purpose, traditionally employing the Gaussian kernel to compute pairwise similarities. However, the Gaussian kernel's lack of data-dependence and computational overhead limit its scalability and effectiveness for categorical biological sequences. Recent work proposed the isolation kernel as an alternative, yet it may not optimally capture sequence similarities. In this study, we comprehensively evaluate nine different kernel functions for t-SNE applied to molecular sequences, using three embedding methods: One-Hot Encoding, Spike2Vec, and minimizers. Through both subjective visualization and objective metrics (including neighborhood preservation scores), we demonstrate that the cosine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBioinformatics and Genomic Networks · Genetic Associations and Epidemiology · Gene expression and cancer classification
