GaKCo: a Fast GApped k-mer string Kernel using COunting
Ritambhara Singh, Arshdeep Sekhon, Kamran Kowsari, Jack Lanchantin,, Beilun Wang, Yanjun Qi

TL;DR
GaKCo introduces a fast, scalable algorithm for gapped k-mer string kernels that significantly improves speed over existing methods while maintaining accuracy, enabling efficient sequence classification across various domains.
Contribution
The paper presents GaKCo, a novel counting-based algorithm for gapped k-mer string kernels that reduces computational complexity and enhances scalability compared to trie-based approaches.
Findings
GaKCo achieves comparable accuracy to state-of-the-art methods.
GaKCo is 2 to 100 times faster depending on the dataset.
The algorithm scales well with larger alphabet sizes and mismatch parameters.
Abstract
String Kernel (SK) techniques, especially those using gapped -mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we increase the dictionary size () or allow more mismatches (). This is because current gk-SK uses a trie-based algorithm to calculate co-occurrence of mismatched substrings resulting in a time cost proportional to . We propose a \textbf{fast} algorithm for calculating \underline{Ga}pped -mer \underline{K}ernel using \underline{Co}unting (GaKCo). GaKCo uses associative arrays to calculate the co-occurrence of substrings using cumulative counting. This algorithm is fast, scalable to larger and , and naturally parallelizable. We provide a rigorous asymptotic analysis that compares GaKCo with the state-of-the-art gk-SK.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Genomics and Chromatin Dynamics
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
