Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate   NLP Pipelines

Gy\"orgy Orosz; Gerg\H{o} Szab\'o; P\'eter Berkecz; Zsolt; Sz\'ant\'o; Rich\'ard Farkas

arXiv:2308.12635·cs.CL·August 25, 2023

Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

Gy\"orgy Orosz, Gerg\H{o} Szab\'o, P\'eter Berkecz, Zsolt, Sz\'ant\'o, Rich\'ard Farkas

PDF

2 Repos

TL;DR

This paper introduces HuSpaCy, an efficient and accurate NLP toolkit for Hungarian that extends spaCy with improved models for core text processing tasks, achieving near state-of-the-art performance.

Contribution

The paper presents new Hungarian NLP models integrated into spaCy, enhancing accuracy and efficiency across multiple text processing tasks.

Findings

01

High accuracy across all NLP tasks for Hungarian

02

Competitive performance compared to existing tools

03

Open-source and reproducible pipelines

Abstract

This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit with several improvements to its architecture. Compared to existing NLP tools for Hungarian, all of our pipelines feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing and named entity recognition with high accuracy and throughput. We thoroughly evaluated the proposed enhancements, compared the pipelines with state-of-the-art tools and demonstrated the competitive performance of the new models in all text preprocessing steps. All experiments are reproducible and the pipelines are freely available under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · fastText · Convolution · Residual Connection · Weight Decay · Attention Dropout · Linear Warmup With Linear Decay · WordPiece · Adam · Dropout