GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Matryoshka Representation Learning and Hybrid Loss Training

Omer Nacar; Anis Koubaa; Serry Sibaee; Yasser Al-Habashi; Adel Ammar; Wadii Boulila

arXiv:2505.24581·cs.CL·June 2, 2025

GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Matryoshka Representation Learning and Hybrid Loss Training

Omer Nacar, Anis Koubaa, Serry Sibaee, Yasser Al-Habashi, Adel Ammar, Wadii Boulila

PDF

Open Access 3 Models

TL;DR

This paper presents GATE, a novel Arabic text embedding model that significantly improves semantic textual similarity tasks by using Matryoshka Representation Learning and hybrid loss training, achieving state-of-the-art results.

Contribution

Introducing GATE, the first to combine Matryoshka Representation Learning with hybrid loss training for Arabic semantic similarity, surpassing existing models in performance.

Findings

01

GATE achieves 20-25% performance improvement on STS benchmarks.

02

Outperforms larger models, including OpenAI, in Arabic semantic similarity tasks.

03

Effective in capturing Arabic language's semantic nuances.

Abstract

Semantic textual similarity (STS) is a critical task in natural language processing (NLP), enabling applications in retrieval, clustering, and understanding semantic relationships between texts. However, research in this area for the Arabic language remains limited due to the lack of high-quality datasets and pre-trained models. This scarcity of resources has restricted the accurate evaluation and advance of semantic similarity in Arabic text. This paper introduces General Arabic Text Embedding (GATE) models that achieve state-of-the-art performance on the Semantic Textual Similarity task within the MTEB benchmark. GATE leverages Matryoshka Representation Learning and a hybrid loss training approach with Arabic triplet datasets for Natural Language Inference, which are essential for enhancing model performance in tasks that demand fine-grained semantic understanding. GATE outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling