Maximum margin classifier working in a set of strings
Hitoshi Koyano, Morihiro Hayashida, Tatsuya Akutsu

TL;DR
This paper introduces a novel maximum margin classifier designed to operate directly on string data, avoiding information loss from vectorization, and provides theoretical analysis of its generalization error.
Contribution
It develops a classifier that works directly on strings and extends probability theory for strings to evaluate its asymptotic optimality.
Findings
The classifier operates directly on string data without vectorization.
Theoretical proof of asymptotic optimality of the classifier.
Application to protein interaction prediction demonstrates practical usefulness.
Abstract
Numbers and numerical vectors account for a large portion of data. However, recently the amount of string data generated has increased dramatically. Consequently, classifying string data is a common problem in many fields. The most widely used approach to this problem is to convert strings into numerical vectors using string kernels and subsequently apply a support vector machine that works in a numerical vector space. However, this non-one-to-one conversion involves a loss of information and makes it impossible to evaluate, using probability theory, the generalization error of a learning machine, considering that the given data to train and test the machine are strings generated according to probability laws. In this study, we approach this classification problem by constructing a classifier that works in a set of strings. To evaluate the generalization error of such a classifier…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRNA and protein synthesis mechanisms · Machine Learning in Bioinformatics · Algorithms and Data Compression
