Creating a contemporary corpus of similes in Serbian by using natural language processing
Nikola Milosevic, Goran Nenadic

TL;DR
This paper develops a semi-automated method using text mining and machine learning to collect Serbian similes from the web, expanding an existing corpus with crowdsourcing, resulting in 787 unique similes.
Contribution
It introduces a novel methodology combining text mining, machine learning, and crowdsourcing for building a comprehensive Serbian simile corpus.
Findings
Expanded the Serbian simile corpus to 787 entries.
Demonstrated effectiveness of semi-automated collection methods.
Integrated crowdsourcing to enhance data collection.
Abstract
Simile is a figure of speech that compares two things through the use of connection words, but where comparison is not intended to be taken literally. They are often used in everyday communication, but they are also a part of linguistic cultural heritage. In this paper we present a methodology for semi-automated collection of similes from the World Wide Web using text mining and machine learning techniques. We expanded an existing corpus by collecting 442 similes from the internet and adding them to the existing corpus collected by Vuk Stefanovic Karadzic that contained 333 similes. We, also, introduce crowdsourcing to the collection of figures of speech, which helped us to build corpus containing 787 unique similes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Lexicography and Language Studies · linguistics and terminology studies
