YAD: Leveraging T5 for Improved Automatic Diacritization of Yor\`ub\'a Text
Akindele Michael Olawole, Jesujoba O. Alabi, Aderonke Busayo Sakpere,, David I. Adelani

TL;DR
This paper introduces YAD, a new Yorùbá diacritization benchmark, and demonstrates that a specialized T5 model trained on Yorùbá data outperforms multilingual models, with larger datasets and models improving accuracy.
Contribution
The paper presents a dedicated Yorùbá diacritization dataset and fine-tunes a T5 model specifically for Yorùbá, achieving superior performance over multilingual models.
Findings
YAD dataset enables effective evaluation of Yorùbá diacritization.
Yorùbá-specific T5 model outperforms multilingual T5 models.
Larger datasets and models lead to better diacritization results.
Abstract
In this work, we present Yor\`ub\'a automatic diacritization (YAD) benchmark dataset for evaluating Yor\`ub\'a diacritization systems. In addition, we pre-train text-to-text transformer, T5 model for Yor\`ub\'a and showed that this model outperform several multilingually trained T5 models. Lastly, we showed that more data and larger models are better at diacritization for Yor\`ub\'a
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Byte Pair Encoding · Linear Layer · SentencePiece · Dropout · Softmax · Attention Is All You Need · Dense Connections · Inverse Square Root Schedule
