Heidelberg-Boston @ SIGTYP 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers
Frederick Riemenschneider, Kevin Krahn

TL;DR
This paper presents a novel approach using character-aware hierarchical transformers and character-level T5 models to improve NLP tasks like PoS tagging, morphological tagging, and lemmatization for low-resource historical languages, achieving top performance in a shared task.
Contribution
It introduces a hierarchical tokenization method combined with DeBERTa-V3 and character-level T5 models for low-resource language analysis, demonstrating state-of-the-art results.
Findings
Achieved first place in the constrained subtask of SIGTYP 2024.
Models nearly matched the performance of unconstrained top models.
Effective use of character-aware hierarchical transformers for low-resource NLP tasks.
Abstract
Historical languages present unique challenges to the NLP community, with one prominent hurdle being the limited resources available in their closed corpora. This work describes our submission to the constrained subtask of the SIGTYP 2024 shared task, focusing on PoS tagging, morphological tagging, and lemmatization for 13 historical languages. For PoS and morphological tagging we adapt a hierarchical tokenization method from Sun et al. (2023) and combine it with the advantages of the DeBERTa-V3 architecture, enabling our models to efficiently learn from every character in the training data. We also demonstrate the effectiveness of character-level T5 models on the lemmatization task. Pre-trained from scratch with limited data, our models achieved first place in the constrained subtask, nearly reaching the performance levels of the unconstrained task's winner. Our code is available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Layer Normalization · SentencePiece · Attention Dropout · Linear Layer · Residual Connection · Multi-Head Attention
