Open foundation models for Azerbaijani language

Jafar Isbarov; Kavsar Huseynova; Elvin Mammadov; Mammad Hajili; Duygu; Ataman

arXiv:2407.02337·cs.CL·August 21, 2024

Open foundation models for Azerbaijani language

Jafar Isbarov, Kavsar Huseynova, Elvin Mammadov, Mammad Hajili, Duygu, Ataman

PDF

Open Access 4 Models 4 Datasets

TL;DR

This paper introduces open-source Azerbaijani language models, a large text corpus, evaluation datasets, and comprehensive benchmarking to advance Azerbaijani NLP technology.

Contribution

It provides the first systemic benchmark and resources for open Azerbaijani foundation models, including datasets and evaluation methods.

Findings

01

Open-source models show competitive performance on Azerbaijani tasks.

02

The new corpus improves model training and evaluation.

03

Benchmark results highlight strengths and gaps in current models.

Abstract

The emergence of multilingual large language models has enabled the development of language understanding and generation systems in Azerbaijani. However, most of the production-grade systems rely on cloud solutions, such as GPT-4. While there have been several attempts to develop open foundation models for Azerbaijani, these works have not found their way into common use due to a lack of systemic benchmarking. This paper encompasses several lines of work that promote open-source foundation models for Azerbaijani. We introduce (1) a large text corpus for Azerbaijani, (2) a family of encoder-only language models trained on this dataset, (3) labeled datasets for evaluating these models, and (4) extensive evaluation that covers all major open-source models with Azerbaijani support.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Byte Pair Encoding · Layer Normalization · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam