New word analogy corpus for exploring embeddings of Czech words
Luk\'a\v{s} Svoboda, Tom\'a\v{s} Brychc\'in

TL;DR
This paper introduces a new Czech word analogy corpus to evaluate the performance of word embedding methods like Word2Vec and GloVe on morphologically rich language, providing insights into their syntactic and semantic capabilities.
Contribution
The paper presents a novel Czech word analogy corpus tailored for morphologically rich language analysis and evaluates popular embedding methods on this resource.
Findings
Word2Vec and GloVe perform differently on Czech analogies.
The corpus reveals challenges in capturing Czech morphology.
Results highlight the need for language-specific embedding evaluation.
Abstract
The word embedding methods have been proven to be very useful in many tasks of NLP (Natural Language Processing). Much has been investigated about word embeddings of English words and phrases, but only little attention has been dedicated to other languages. Our goal in this paper is to explore the behavior of state-of-the-art word embedding methods on Czech, the language that is characterized by very rich morphology. We introduce new corpus for word analogy task that inspects syntactic, morphosyntactic and semantic properties of Czech words and phrases. We experiment with Word2Vec and GloVe algorithms and discuss the results on this corpus. The corpus is available for the research community.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques
MethodsGloVe Embeddings
