Hakim: Farsi Text Embedding Model

Mehran Sarmadi; Morteza Alikhani; Erfan Zinvandi; Zahra Pourbahman

arXiv:2505.08435·cs.CL·October 10, 2025

Hakim: Farsi Text Embedding Model

Mehran Sarmadi, Morteza Alikhani, Erfan Zinvandi, Zahra Pourbahman

PDF

Open Access 3 Models

TL;DR

Hakim is a new Persian text embedding model that outperforms previous models by 8.5% on the FaMTEB benchmark, supporting applications like chatbots and retrieval systems with new datasets and improved accuracy.

Contribution

Introduces Hakim, a state-of-the-art Persian text embedding model with new datasets and a BERT-based baseline, advancing Persian NLP capabilities.

Findings

01

Achieves 8.5% performance improvement on FaMTEB benchmark.

02

Outperforms previous Persian language models.

03

Effective for retrieval and chatbot applications.

Abstract

Recent advancements in text embedding have significantly improved natural language understanding across many languages, yet Persian remains notably underrepresented in large-scale embedding research. In this paper, we present Hakim, a novel state-of-the-art Persian text embedding model that achieves a 8.5% performance improvement over existing approaches on the FaMTEB benchmark, outperforming all previously developed Persian language models. As part of this work, we introduce three new datasets - Corpesia, Pairsia-sup, and Pairsia-unsup - to support supervised and unsupervised training scenarios. Additionally, Hakim is designed for applications in chatbots and retrieval-augmented generation (RAG) systems, particularly addressing retrieval tasks that require incorporating message history within these systems. We also propose a new baseline model built on the BERT architecture. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Dropout · Layer Normalization · Attention Dropout · Softmax · Residual Connection · WordPiece · Linear Layer