Character 3-gram Mover's Distance: An Effective Method for Detecting Near-duplicate Japanese-language Recipes
Masaki Oguni, Yohei Seki, Yu Hirate

TL;DR
This paper introduces a novel character 3-gram Mover's Distance method for effectively detecting near-duplicate Japanese recipes in large user-generated recipe datasets, outperforming existing comparison methods.
Contribution
The study extends Word Mover's Distance to character 3-gram embeddings and demonstrates its effectiveness in identifying near-duplicate recipes in a large-scale corpus.
Findings
Successfully detected near-duplicate recipes missed by comparison methods
Learned embeddings using Skip-Gram and fastText models
Proven effectiveness on a dataset of over 1.21 million recipes
Abstract
In user-generated recipe websites, users post their-original recipes. Some recipes, however, are very similar in major components such as the cooking instructions to other recipes. We refer to such recipes as "near-duplicate recipes". In this study, we propose a method that extends the "Word Mover's Distance", which calculates distances between texts based on word embedding, to character 3-gram embedding. Using a corpus of over 1.21 million recipes, we learned the word embedding and the character 3-gram embedding by using a Skip-Gram model with negative sampling and fastText to extract candidate pairs of near-duplicate recipes. We then annotated these candidates and evaluated the proposed method against a comparison method. Our results demonstrated that near-duplicate recipes that were not detected by the comparison method were successfully detected by the proposed method.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
MethodsfastText
