New word analogy corpus for exploring embeddings of Czech words

Luk\'a\v{s} Svoboda; Tom\'a\v{s} Brychc\'in

arXiv:1608.00789·cs.CL·August 3, 2016·2 cites

New word analogy corpus for exploring embeddings of Czech words

Luk\'a\v{s} Svoboda, Tom\'a\v{s} Brychc\'in

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new Czech word analogy corpus to evaluate the performance of word embedding methods like Word2Vec and GloVe on morphologically rich language, providing insights into their syntactic and semantic capabilities.

Contribution

The paper presents a novel Czech word analogy corpus tailored for morphologically rich language analysis and evaluates popular embedding methods on this resource.

Findings

01

Word2Vec and GloVe perform differently on Czech analogies.

02

The corpus reveals challenges in capturing Czech morphology.

03

Results highlight the need for language-specific embedding evaluation.

Abstract

The word embedding methods have been proven to be very useful in many tasks of NLP (Natural Language Processing). Much has been investigated about word embeddings of English words and phrases, but only little attention has been dedicated to other languages. Our goal in this paper is to explore the behavior of state-of-the-art word embedding methods on Czech, the language that is characterized by very rich morphology. We introduce new corpus for word analogy task that inspects syntactic, morphosyntactic and semantic properties of Czech words and phrases. We experiment with Word2Vec and GloVe algorithms and discuss the results on this corpus. The corpus is available for the research community.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Svobikl/cz_corpus
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques

MethodsGloVe Embeddings