Can string kernels pass the test of time in Native Language Identification?
Radu Tudor Ionescu, Marius Popescu

TL;DR
This paper demonstrates that simple string kernel methods, combined with multiple kernel learning, can achieve state-of-the-art results in Native Language Identification, outperforming recent NLP advances in the 2017 shared task.
Contribution
The study shows that a shallow, string kernel-based approach with minor improvements remains competitive and effective for NLI, even compared to modern NLP techniques.
Findings
Achieved top macro F1 scores in all three NLI tracks.
Outperformed other methods in speech and fusion tracks.
Validated the effectiveness of string kernels in NLI tasks.
Abstract
We describe a machine learning approach for the 2017 shared task on Native Language Identification (NLI). The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from essays or speech transcripts, we also use a kernel based on i-vectors, a low-dimensional representation of audio recordings, provided by the shared task organizers. For the learning stage, we choose Kernel Discriminant Analysis (KDA) over Kernel Ridge Regression (KRR), because the former classifier obtains better results than the latter one on the development set. In our previous work, we have used a similar machine learning approach to achieve state-of-the-art NLI results. The goal of this paper is to demonstrate that our shallow and simple approach based on string kernels (with minor improvements) can pass…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
