# An 81-million-word multi-genre corpus of Arabic books

**Authors:** Andreas Hallberg

PMC · DOI: 10.1016/j.dib.2025.111456 · Data in Brief · 2025-03-09

## TL;DR

This paper introduces a large Arabic corpus of 81 million words from books across multiple genres, available for linguistic research and language model training.

## Contribution

The paper presents a new, freely available multi-genre Arabic corpus with extensive metadata for linguistic and NLP research.

## Key findings

- The corpus includes 1,745 books in various genres, totaling 81.5 million words.
- Metadata such as author, genre, and publication dates were collected for each book.
- The corpus is suitable for studying linguistic variation and training language models.

## Abstract

This article describes The Arabic E-Book Corpus, a freely available Arabic corpus consisting of 1,745 books (81,5 million words) published by the Hindawi Foundation between 2008 and 2024. The books are of various genres, including fiction and non-fiction, children's literature, plays, and poetry. Most of the texts are editions of works originally published in the 20th century, but the corpus also includes editions of older historical works. Books were retrieved in epub format and converted to plain text and html. Only books published under unrestricted licenses are included. Extensive metadata (were collected from colophons and the publisher's website title, author, genre, publication date, original publication date, original language, etc.). The corpus was originally collected in order to investigate variation in the use of vowel diacritics across genres, but it is also suitable for other linguistic inquiries, especially as relating to genre, and as a source of texts published under free licenses for training language models.

## Full-text entities

- **Chemicals:** DOCTYPE (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11981761/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11981761/full.md

## References

15 references — full list in the complete paper: https://tomesphere.com/paper/PMC11981761/full.md

---
Source: https://tomesphere.com/paper/PMC11981761