Novel Keyword Extraction and Language Detection Approaches
Malgorzata Pikies, Andronicus Riyono, Junade Ali

TL;DR
This paper introduces a fast string tokenization method for fuzzy language matching that improves processing speed and recall, and explores metadata's role in language classification accuracy.
Contribution
It presents a novel tokenization approach for fuzzy language matching and demonstrates the effectiveness of using Accept-Language headers for improved classification.
Findings
83.6% decrease in processing time
3.1% improvement in recall
Accept-Language header improves classification accuracy by 14%
Abstract
Fuzzy string matching and language classification are important tools in Natural Language Processing pipelines, this paper provides advances in both areas. We propose a fast novel approach to string tokenisation for fuzzy language matching and experimentally demonstrate an 83.6% decrease in processing time with an estimated improvement in recall of 3.1% at the cost of a 2.6% decrease in precision. This approach is able to work even where keywords are subdivided into multiple words, without needing to scan character-to-character. So far there has been little work considering using metadata to enhance language classification algorithms. We provide observational data and find the Accept-Language header is 14% more likely to match the classification than the IP Address.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Topic Modeling
