# Improved Arabic query expansion using word embedding

**Authors:** Yaser A. Al-Lahham, Sattam Almatarneh, Kaznah Alshammari, Mutasem Al-Smadi

PMC · DOI: 10.1038/s41598-025-28758-0 · Scientific Reports · 2026-01-22

## TL;DR

This paper introduces a faster way to train word embedding models for Arabic query expansion, reducing training time while maintaining performance.

## Contribution

The novel approach uses a representative subset of data to train word embeddings for Arabic, significantly reducing training time without sacrificing retrieval efficiency.

## Key findings

- Using a subset of words reduces training time by 90% while preserving retrieval efficiency.
- The proposed method outperforms standard PRFQE by 7% in MAP and 14.5% in P10.
- There are no significant differences between using different word embedding models for Arabic query expansion.

## Abstract

Word embedding enhances pseudo-relevance feedback query expansion (PRFQE), but training word embedding models takes a long time and is applied to large datasets. Moreover, the Arabic language, which has rich morphology, dialectal variations, and a lack of high-quality linguistic resources, training embedding models need special processing. This paper proposes using a representative subset of a dataset to train such models and defines the conditions of representativeness. Using a suitable subset of words to train a word embedding model is effective since it dramatically decreases the training time while preserving the retrieval efficiency. This paper shows that a subset of words is derived from an Arabic dataset, which consumes 10% of the training time of the whole dataset, and preserves retrieval efficiency. The trained models are used to embed words for different scenarios of Arabic query expansion, and the proposed training method shows effectiveness as it outperforms the ordinary PRFQE by at least 7% Mean Average Precision (MAP) and 14.5% precision improvement at the 10th returned document (P10). Moreover, the improvement over not using the query expansion is 21.7% for MAP and 21.32% for the P10. The results show no significant differences between using different word embedding models for Arabic query expansion.

## Full-text entities

- **Diseases:** CLS (MESH:D038921), IR (MESH:C537629)
- **Chemicals:** AL (MESH:D000535), PRFQE (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** ESX — Homo sapiens (Human), Embryonic stem cell (CVCL_A1M0)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12830740/full.md

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12830740/full.md

## References

14 references — full list in the complete paper: https://tomesphere.com/paper/PMC12830740/full.md

---
Source: https://tomesphere.com/paper/PMC12830740