HuSpaCy: an industrial-strength Hungarian natural language processing toolkit
Gy\"orgy Orosz, Zsolt Sz\'ant\'o, P\'eter Berkecz, Gerg\H{o} Szab\'o,, Rich\'ard Farkas

TL;DR
HuSpaCy is an open-source, industrial-strength Hungarian NLP toolkit built on spaCy, offering fast, accurate, and resource-efficient components for core linguistic analysis tasks suitable for real-world applications.
Contribution
It introduces HuSpaCy, a comprehensive, industry-ready Hungarian NLP toolkit that integrates state-of-the-art linguistic analysis with high efficiency and multi-language support.
Findings
High accuracy in linguistic tasks
Resource-efficient prediction capabilities
Open-source and easy to use
Abstract
Although there are a couple of open-source language processing pipelines available for Hungarian, none of them satisfies the requirements of today's NLP applications. A language processing pipeline should consist of close to state-of-the-art lemmatization, morphosyntactic analysis, entity recognition and word embeddings. Industrial text processing applications have to satisfy non-functional software quality requirements, what is more, frameworks supporting multiple languages are more and more favored. This paper introduces HuSpaCy, an industry-ready Hungarian language processing toolkit. The presented tool provides components for the most important basic linguistic analysis tasks. It is open-source and is available under a permissive license. Our system is built upon spaCy's NLP components resulting in an easily usable, fast yet accurate application. Experiments confirm that HuSpaCy has…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsContinuous Bag-of-Words Word2Vec · Convolution
