DariMis: Harm-Aware Modeling for Dari Misinformation Detection on YouTube

Jawid Ahmad Baktash; Mosa Ebrahimi; Mohammad Zarif Joya; and Mursal Dawodi

arXiv:2603.22977·cs.CL·March 25, 2026

DariMis: Harm-Aware Modeling for Dari Misinformation Detection on YouTube

Jawid Ahmad Baktash, Mosa Ebrahimi, Mohammad Zarif Joya, and Mursal Dawodi

PDF

Open Access

TL;DR

This paper introduces DariMis, a new annotated dataset for Dari-language YouTube videos, and proposes a pair-input encoding method that improves misinformation detection accuracy by modeling semantic relationships between video titles and descriptions.

Contribution

It presents the first Dari misinformation dataset, analyzes the coupling of misinformation and harm levels, and introduces a pair-input encoding strategy that enhances detection performance.

Findings

01

Over half of misinformation videos carry medium or high harm potential.

02

Pair-input encoding improves misinformation recall by 7 percentage points.

03

ParsBERT outperforms XLM-RoBERTa-base on Dari misinformation detection.

Abstract

Dari, the primary language of Afghanistan, is spoken by tens of millions of people yet remains largely absent from the misinformation detection literature. We address this gap with DariMis, the first manually annotated dataset of 9,224 Dari-language YouTube videos, labeled across two dimensions: Information Type (Misinformation, Partly True, True) and Harm Level (Low, Medium, High). A central empirical finding is that these dimensions are structurally coupled, not independent: 55.9 percent of Misinformation carries at least Medium harm potential, compared with only 1.0 percent of True content. This enables Information Type classifiers to function as implicit harm-triage filters in content moderation pipelines. We further propose a pair-input encoding strategy that represents the video title and description as separate BERT segment inputs, explicitly modeling the semantic relationship…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMisinformation and Its Impacts · Spam and Phishing Detection · Hate Speech and Cyberbullying Detection