Lesan -- Machine Translation for Low Resource Languages
Asmelash Teka Hadgu, Abel Aregawi, Adam Beaudoin

TL;DR
Lesan is a new machine translation system designed for low-resource languages, leveraging innovative data collection, OCR, and sequence-to-sequence models to outperform existing systems and improve web access for underserved language speakers.
Contribution
Lesan introduces a pipeline combining online/offline data sources, OCR, and automatic alignment with Transformer-based models for low-resource language translation.
Findings
Lesan outperforms Google Translate and Microsoft Translator in human evaluations.
Supports translation for Tigrinya, Amharic, and English.
Has served over 10 million translations to date.
Abstract
Millions of people around the world can not access content on the Web because most of the content is not readily available in their language. Machine translation (MT) systems have the potential to change this for many languages. Current MT systems provide very accurate results for high resource language pairs, e.g., German and English. However, for many low resource languages, MT is still under active research. The key challenge is lack of datasets to build these systems. We present Lesan, an MT system for low resource languages. Our pipeline solves the key bottleneck to low resource MT by leveraging online and offline sources, a custom OCR system for Ethiopic and an automatic alignment module. The final step in the pipeline is a sequence to sequence model that takes parallel corpus as input and gives us a translation model. Lesan's translation model is based on the Transformer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Layer Normalization · Dropout · Label Smoothing · Byte Pair Encoding · Softmax
