Hakim: Farsi Text Embedding Model
Mehran Sarmadi, Morteza Alikhani, Erfan Zinvandi, Zahra Pourbahman

TL;DR
Hakim is a new Persian text embedding model that outperforms previous models by 8.5% on the FaMTEB benchmark, supporting applications like chatbots and retrieval systems with new datasets and improved accuracy.
Contribution
Introduces Hakim, a state-of-the-art Persian text embedding model with new datasets and a BERT-based baseline, advancing Persian NLP capabilities.
Findings
Achieves 8.5% performance improvement on FaMTEB benchmark.
Outperforms previous Persian language models.
Effective for retrieval and chatbot applications.
Abstract
Recent advancements in text embedding have significantly improved natural language understanding across many languages, yet Persian remains notably underrepresented in large-scale embedding research. In this paper, we present Hakim, a novel state-of-the-art Persian text embedding model that achieves a 8.5% performance improvement over existing approaches on the FaMTEB benchmark, outperforming all previously developed Persian language models. As part of this work, we introduce three new datasets - Corpesia, Pairsia-sup, and Pairsia-unsup - to support supervised and unsupervised training scenarios. Additionally, Hakim is designed for applications in chatbots and retrieval-augmented generation (RAG) systems, particularly addressing retrieval tasks that require incorporating message history within these systems. We also propose a new baseline model built on the BERT architecture. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Dropout · Layer Normalization · Attention Dropout · Softmax · Residual Connection · WordPiece · Linear Layer
