Guidelines and Annotation Framework for Arabic Author Profiling
Wajdi Zaghouani, Anis Charfi

TL;DR
This paper introduces an annotation framework and guidelines for creating a large, high-quality Arabic author profiling dataset from social media, covering multiple dialects and regions.
Contribution
It provides a comprehensive annotation pipeline, dialect-specific guidelines, and quality control methods for Arabic author profiling data collection.
Findings
Achieved high inter-annotator agreement
Created a dataset covering 16 Arabic countries and 11 dialects
Identified key challenges in annotating Arabic dialects
Abstract
In this paper, we present the annotation pipeline and the guidelines we wrote as part of an effort to create a large manually annotated Arabic author profiling dataset from various social media sources covering 16 Arabic countries and 11 dialectal regions. The target size of the annotated ARAP-Tweet corpus is more than 2.4 million words. We illustrate and summarize our general and dialect-specific guidelines for each of the dialectal regions selected. We also present the annotation framework and logistics. We control the annotation quality frequently by computing the inter-annotator agreement during the annotation process. Finally, we describe the issues encountered during the annotation phase, especially those related to the peculiarities of Arabic dialectal varieties as used in social media.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Topic Modeling
