YAD: Leveraging T5 for Improved Automatic Diacritization of Yor\`ub\'a   Text

Akindele Michael Olawole; Jesujoba O. Alabi; Aderonke Busayo Sakpere,; David I. Adelani

arXiv:2412.20218·cs.CL·December 31, 2024

YAD: Leveraging T5 for Improved Automatic Diacritization of Yor\`ub\'a Text

Akindele Michael Olawole, Jesujoba O. Alabi, Aderonke Busayo Sakpere,, David I. Adelani

PDF

Open Access 1 Repo

TL;DR

This paper introduces YAD, a new Yorùbá diacritization benchmark, and demonstrates that a specialized T5 model trained on Yorùbá data outperforms multilingual models, with larger datasets and models improving accuracy.

Contribution

The paper presents a dedicated Yorùbá diacritization dataset and fine-tunes a T5 model specifically for Yorùbá, achieving superior performance over multilingual models.

Findings

01

YAD dataset enables effective evaluation of Yorùbá diacritization.

02

Yorùbá-specific T5 model outperforms multilingual T5 models.

03

Larger datasets and models lead to better diacritization results.

Abstract

In this work, we present Yor\`ub\'a automatic diacritization (YAD) benchmark dataset for evaluating Yor\`ub\'a diacritization systems. In addition, we pre-train text-to-text transformer, T5 model for Yor\`ub\'a and showed that this model outperform several multilingually trained T5 models. Lastly, we showed that more data and larger models are better at diacritization for Yor\`ub\'a

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ajesujoba/YAD
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Byte Pair Encoding · Linear Layer · SentencePiece · Dropout · Softmax · Attention Is All You Need · Dense Connections · Inverse Square Root Schedule