Plagiarism Detection in the Bengali Language: A Text Similarity-Based Approach
Satyajit Ghosh, Aniruddha Ghosh, Bittaswer Ghosh, and Abhishek Roy

TL;DR
This paper presents a novel approach for detecting plagiarism in Bengali texts using text similarity measures, specifically Levenshtein Distance, and develops a web tool for practical use despite challenges like limited digital Bengali literature.
Contribution
It introduces a Bengali-specific plagiarism detection method utilizing OCR and text similarity, filling a gap in language-specific tools and creating a new corpus for Bengali literature.
Findings
Achieved 72.10% - 79.89% accuracy in text extraction with OCR.
Implemented a web-based plagiarism detection tool for Bengali.
Constructed a Bengali literature corpus from digital sources.
Abstract
Plagiarism means taking another person's work and not giving any credit to them for it. Plagiarism is one of the most serious problems in academia and among researchers. Even though there are multiple tools available to detect plagiarism in a document but most of them are domain-specific and designed to work in English texts, but plagiarism is not limited to a single language only. Bengali is the most widely spoken language of Bangladesh and the second most spoken language in India with 300 million native speakers and 37 million second-language speakers. Plagiarism detection requires a large corpus for comparison. Bengali Literature has a history of 1300 years. Hence most Bengali Literature books are not yet digitalized properly. As there was no such corpus present for our purpose so we have collected Bengali Literature books from the National Digital Library of India and with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAcademic integrity and plagiarism · Text Readability and Simplification
