TL;DR
This paper introduces HuSpaCy, an efficient and accurate NLP toolkit for Hungarian that extends spaCy with improved models for core text processing tasks, achieving near state-of-the-art performance.
Contribution
The paper presents new Hungarian NLP models integrated into spaCy, enhancing accuracy and efficiency across multiple text processing tasks.
Findings
High accuracy across all NLP tasks for Hungarian
Competitive performance compared to existing tools
Open-source and reproducible pipelines
Abstract
This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit with several improvements to its architecture. Compared to existing NLP tools for Hungarian, all of our pipelines feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing and named entity recognition with high accuracy and throughput. We thoroughly evaluated the proposed enhancements, compared the pipelines with state-of-the-art tools and demonstrated the competitive performance of the new models in all text preprocessing steps. All experiments are reproducible and the pipelines are freely available under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · fastText · Convolution · Residual Connection · Weight Decay · Attention Dropout · Linear Warmup With Linear Decay · WordPiece · Adam · Dropout
