A Preliminary Study for Building an Arabic Corpus of Pair Questions-Texts from the Web: AQA-Webcorp
Wided Bakari (1), Patrice Bellot (1), Mahmoud Neji ((1) LSIS)

TL;DR
This paper presents an initial effort to build an Arabic question-answer corpus from web data, involving web scraping, text cleaning, and preliminary validation to support NLP applications.
Contribution
It introduces a novel method for extracting and constructing an Arabic question-answer corpus from web sources, including a custom JavaScript tool and initial experimental results.
Findings
Developed a JavaScript tool for web page extraction based on queries.
Created a preliminary Arabic question-answer corpus from web data.
Presented initial validation results of the corpus quality.
Abstract
With the development of electronic media and the heterogeneity of Arabic data on the Web, the idea of building a clean corpus for certain applications of natural language processing, including machine translation, information retrieval, question answer, become more and more pressing. In this manuscript, we seek to create and develop our own corpus of pair's questions-texts. This constitution then will provide a better base for our experimentation step. Thus, we try to model this constitution by a method for Arabic insofar as it recovers texts from the web that could prove to be answers to our factual questions. To do this, we had to develop a java script that can extract from a given query a list of html pages. Then clean these pages to the extent of having a data base of texts and a corpus of pair's question-texts. In addition, we give preliminary results of our proposal method. Some…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Advanced Text Analysis Techniques
