Guidelines and Annotation Framework for Arabic Author Profiling

Wajdi Zaghouani; Anis Charfi

arXiv:1808.07678·cs.CL·August 24, 2018·6 cites

Guidelines and Annotation Framework for Arabic Author Profiling

Wajdi Zaghouani, Anis Charfi

PDF

Open Access

TL;DR

This paper introduces an annotation framework and guidelines for creating a large, high-quality Arabic author profiling dataset from social media, covering multiple dialects and regions.

Contribution

It provides a comprehensive annotation pipeline, dialect-specific guidelines, and quality control methods for Arabic author profiling data collection.

Findings

01

Achieved high inter-annotator agreement

02

Created a dataset covering 16 Arabic countries and 11 dialects

03

Identified key challenges in annotating Arabic dialects

Abstract

In this paper, we present the annotation pipeline and the guidelines we wrote as part of an effort to create a large manually annotated Arabic author profiling dataset from various social media sources covering 16 Arabic countries and 11 dialectal regions. The target size of the annotated ARAP-Tweet corpus is more than 2.4 million words. We illustrate and summarize our general and dialect-specific guidelines for each of the dialectal regions selected. We also present the annotation framework and logistics. We control the annotation quality frequently by computing the inter-annotator agreement during the annotation process. Finally, we describe the issues encountered during the annotation phase, especially those related to the peculiarities of Arabic dialectal varieties as used in social media.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Topic Modeling