A Comparative Analysis of Static Word Embeddings for Hungarian

M\'at\'e Gedeon

arXiv:2505.07809·cs.CL·September 30, 2025

A Comparative Analysis of Static Word Embeddings for Hungarian

M\'at\'e Gedeon

PDF

1 Repo 1 Models 1 Datasets

TL;DR

This study compares various static word embeddings for Hungarian, evaluating traditional models and BERT-based methods on intrinsic and extrinsic tasks, revealing the strengths of FastText and the X2Static extraction approach.

Contribution

It introduces a comprehensive evaluation of static embeddings for Hungarian, including novel extraction methods from BERT-based models, and provides insights into their relative performance.

Findings

01

FastText excels in intrinsic analogy tasks.

02

X2Static extraction improves BERT-based static embeddings.

03

ELMo embeddings perform best in NER and POS tagging.

Abstract

This paper presents a comprehensive analysis of various static word embeddings for Hungarian, including traditional models such as Word2Vec, FastText, as well as static embeddings derived from BERT-based models using different extraction methods. We evaluate these embeddings on both intrinsic and extrinsic tasks to provide a holistic view of their performance. For intrinsic evaluation, we employ a word analogy task, which assesses the embeddings ability to capture semantic and syntactic relationships. Our results indicate that traditional static embeddings, particularly FastText, excel in this task, achieving high accuracy and mean reciprocal rank (MRR) scores. Among the BERT-based models, the X2Static method for extracting static embeddings demonstrates superior performance compared to decontextualized and aggregate methods, approaching the effectiveness of traditional static…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gedeonmate/hungarian_static_embeddings
tfOfficial

Models

🤗
gedeonmate/static_hungarian_bert
model

Datasets

gedeonmate/hun_stat_dataset
dataset· 26 dl
26 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Tanh Activation · Bidirectional LSTM · ELMo · Sigmoid Activation · Long Short-Term Memory