GaKCo: a Fast GApped k-mer string Kernel using COunting

Ritambhara Singh; Arshdeep Sekhon; Kamran Kowsari; Jack Lanchantin,; Beilun Wang; Yanjun Qi

arXiv:1704.07468·cs.LG·September 19, 2017·1 cites

GaKCo: a Fast GApped k-mer string Kernel using COunting

Ritambhara Singh, Arshdeep Sekhon, Kamran Kowsari, Jack Lanchantin,, Beilun Wang, Yanjun Qi

PDF

Open Access 1 Repo

TL;DR

GaKCo introduces a fast, scalable algorithm for gapped k-mer string kernels that significantly improves speed over existing methods while maintaining accuracy, enabling efficient sequence classification across various domains.

Contribution

The paper presents GaKCo, a novel counting-based algorithm for gapped k-mer string kernels that reduces computational complexity and enhances scalability compared to trie-based approaches.

Findings

01

GaKCo achieves comparable accuracy to state-of-the-art methods.

02

GaKCo is 2 to 100 times faster depending on the dataset.

03

The algorithm scales well with larger alphabet sizes and mismatch parameters.

Abstract

String Kernel (SK) techniques, especially those using gapped $k$ -mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we increase the dictionary size ( $Σ$ ) or allow more mismatches ( $M$ ). This is because current gk-SK uses a trie-based algorithm to calculate co-occurrence of mismatched substrings resulting in a time cost proportional to $O (Σ^{M})$ . We propose a \textbf{fast} algorithm for calculating \underline{Ga}pped $k$ -mer \underline{K}ernel using \underline{Co}unting (GaKCo). GaKCo uses associative arrays to calculate the co-occurrence of substrings using cumulative counting. This algorithm is fast, scalable to larger $Σ$ and $M$ , and naturally parallelizable. We provide a rigorous asymptotic analysis that compares GaKCo with the state-of-the-art gk-SK.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

QData/GaKCo-SVM
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Genomics and Chromatin Dynamics

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings