Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity
Nkechi Ifeanyi-Reuben, Chidiebere Ugwu, Nwachukwu E.O

TL;DR
This study compares unigram and bigram text representations for Igbo document similarity, finding that bigram models provide more accurate similarity measures, which can improve Igbo text processing tasks.
Contribution
It introduces a comparative analysis of n-gram models for Igbo text similarity, highlighting the effectiveness of bigram representation over unigram.
Findings
Bigram models yield lower distance values indicating higher similarity.
Igbo text similarity is more accurate with bigram representation.
The study demonstrates the effectiveness of bigram models for Igbo text tasks.
Abstract
The improvement in Information Technology has encouraged the use of Igbo in the creation of text such as resources and news articles online. Text similarity is of great importance in any text-based applications. This paper presents a comparative analysis of n-gram text representation on Igbo text document similarity. It adopted Euclidean similarity measure to determine the similarities between Igbo text documents represented with two word-based n-gram text representation (unigram and bigram) models. The evaluation of the similarity measure is based on the adopted text representation models. The model is designed with Object-Oriented Methodology and implemented with Python programming language with tools from Natural Language Toolkits (NLTK). The result shows that unigram represented text has highest distance values whereas bigram has the lowest corresponding distance values. The lower…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
