Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization
Sara Bourbour Hosseinbeigi, Sina Asghari, Mohammad Ali Seif Kashani,, Mohammad Hossein Shalchian, Mohammad Amin Abbasi

TL;DR
This paper advances Persian Retrieval-Augmented Generation by developing language-specific models, establishing comprehensive benchmarks, and exploring optimization techniques to improve NLP tasks in low-resource language settings.
Contribution
Introduces Persian-specific models and a benchmarking framework, demonstrating improved retrieval and generation accuracy for low-resource Persian NLP applications.
Findings
MatinaSRoberta outperforms previous embeddings in relevance and accuracy
Larger models like Llama-3.1 achieve highest generation accuracy
Optimization techniques enhance RAG system performance
Abstract
This paper examines the specific obstacles of constructing Retrieval-Augmented Generation(RAG) systems in low-resource languages, with a focus on Persian's complicated morphology and versatile syntax. The research aims to improve retrieval and generation accuracy by introducing Persian-specific models, namely MatinaRoberta(a masked language model) and MatinaSRoberta(a fine-tuned Sentence-BERT), along with a comprehensive benchmarking framework. Three datasets-general knowledge(PQuad), scientifically specialized texts, and organizational reports, were used to assess these models after they were trained on a varied corpus of 73.11 billion Persian tokens. The methodology involved extensive pretraining, fine-tuning with tailored loss functions, and systematic evaluations using both traditional metrics and the Retrieval-Augmented Generation Assessment framework. The results show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Layer Normalization · Dense Connections · Linear Warmup With Linear Decay · WordPiece · Attention Dropout · Adam · Residual Connection · Dropout
