Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora
Khalil Hennara, Ahmad Bastati, Muhammad Hreden, Mohamed Motasim Hamed, Zeina Aldallal, Sara Chrouf, and Safwan AlModhayan

TL;DR
This paper introduces Wasm, a pipeline for creating structured Arabic multimodal corpora from web data, enabling improved pre-training for Arabic language and multimodal models by preserving document structure.
Contribution
The paper presents a novel pipeline for processing Arabic web data into structured multimodal datasets, addressing the lack of high-quality Arabic corpora with preserved document structure.
Findings
Wasm effectively preserves document structure in Arabic multimodal datasets.
The pipeline enables flexible pre-training scenarios for Arabic multimodal models.
Public release of the dataset and pipeline supports future research.
Abstract
The performance of large language models (LLMs) and large multimodal models (LMMs) depends heavily on the quality and scale of their pre-training datasets. Recent research shows that large multimodal models trained on natural documents where images and text are interleaved outperform those trained only on image-text pairs across a wide range of benchmarks, leveraging advanced pre-trained models to enforce semantic alignment, image-sequence consistency, and textual coherence. For Arabic, however, the lack of high-quality multimodal datasets that preserve document structure has limited progress. In this paper, we present our pipeline Wasm for processing the Common Crawl dataset to create a new Arabic multimodal dataset that uniquely provides markdown output. Unlike existing Arabic corpora that focus solely on text extraction, our approach preserves the structural integrity of web content…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
