Novel Keyword Extraction and Language Detection Approaches

Malgorzata Pikies; Andronicus Riyono; Junade Ali

arXiv:2009.11832·cs.CL·September 25, 2020·1 cites

Novel Keyword Extraction and Language Detection Approaches

Malgorzata Pikies, Andronicus Riyono, Junade Ali

PDF

Open Access

TL;DR

This paper introduces a fast string tokenization method for fuzzy language matching that improves processing speed and recall, and explores metadata's role in language classification accuracy.

Contribution

It presents a novel tokenization approach for fuzzy language matching and demonstrates the effectiveness of using Accept-Language headers for improved classification.

Findings

01

83.6% decrease in processing time

02

3.1% improvement in recall

03

Accept-Language header improves classification accuracy by 14%

Abstract

Fuzzy string matching and language classification are important tools in Natural Language Processing pipelines, this paper provides advances in both areas. We propose a fast novel approach to string tokenisation for fuzzy language matching and experimentally demonstrate an 83.6% decrease in processing time with an estimated improvement in recall of 3.1% at the cost of a 2.6% decrease in precision. This approach is able to work even where keywords are subdivided into multiple words, without needing to scan character-to-character. So far there has been little work considering using metadata to enhance language classification algorithms. We provide observational data and find the Accept-Language header is 14% more likely to match the classification than the IP Address.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Topic Modeling