Informative Initialization and Kernel Selection Improves t-SNE for Biological Sequences
Prakash Chourasia, Sarwan Ali, Murray Patterson

TL;DR
This paper demonstrates that using informed initialization and alternative kernel choices significantly enhances t-SNE's performance and convergence speed when visualizing biological sequence data.
Contribution
The study introduces the use of informed initialization and kernel selection to improve t-SNE's effectiveness for biological sequences.
Findings
Improved t-SNE visualizations with better cluster separation.
Faster convergence of t-SNE with informed initialization.
Enhanced accuracy in biological sequence data representation.
Abstract
The t-distributed stochastic neighbor embedding (t- SNE) is a method for interpreting high dimensional (HD) data by mapping each point to a low dimensional (LD) space (usually two-dimensional). It seeks to retain the structure of the data. An important component of the t-SNE algorithm is the initialization procedure, which begins with the random initialization of an LD vector. Points in this initial vector are then updated to minimize the loss function (the KL divergence) iteratively using gradient descent. This leads comparable points to attract one another while pushing dissimilar points apart. We believe that, by default, these algorithms should employ some form of informative initialization. Another essential component of the t-SNE is using a kernel matrix, a similarity matrix comprising the pairwise distances among the sequences. For t-SNE-based visualization, the Gaussian kernel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Face and Expression Recognition · Bioinformatics and Genomic Networks
