More Romanian word embeddings from the RETEROM project
Vasile P\u{a}i\c{s}, Dan Tufi\c{s}

TL;DR
This paper discusses the development of diverse Romanian word embeddings using the RETEROM project, incorporating various linguistic features to enhance natural language processing tasks.
Contribution
It introduces new Romanian word embedding sets with different features, expanding on previous models by including lemmas and POS tags for improved NLP applications.
Findings
Existing embeddings based on word occurrences are augmented with lemma and POS features.
New embeddings enable better morphological, syntactic, and semantic analysis.
Graphical tools are developed for exploring the vector representations.
Abstract
Automatically learned vector representations of words, also known as "word embeddings", are becoming a basic building block for more and more natural language processing algorithms. There are different ways and tools for constructing word embeddings. Most of the approaches rely on raw texts, the construction items being the word occurrences and/or letter n-grams. More elaborated research is using additional linguistic features extracted after text preprocessing. Morphology is clearly served by vector representations constructed from raw texts and letter n-grams. Syntax and semantics studies may profit more from the vector representations constructed with additional features such as lemma, part-of-speech, syntactic or semantic dependants associated with each word. One of the key objectives of the ReTeRom project is the development of advanced technologies for Romanian natural language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
