SynDy: Synthetic Dynamic Dataset Generation Framework for Misinformation   Tasks

Michael Shliselberg; Ashkan Kazemi; Scott A. Hale; Shiri; Dori-Hacohen

arXiv:2405.10700·cs.IR·May 20, 2024

SynDy: Synthetic Dynamic Dataset Generation Framework for Misinformation Tasks

Michael Shliselberg, Ashkan Kazemi, Scott A. Hale, Shiri, Dori-Hacohen

PDF

TL;DR

SynDy is a novel framework that uses large language models to generate synthetic, labeled datasets for misinformation detection tasks, significantly reducing the need for costly human annotation and aiding fact-checking efforts in underserved communities.

Contribution

It introduces the first use of LLMs for creating fine-grained synthetic labels for misinformation-related tasks, enhancing data availability for training specialized models.

Findings

01

Training on SynDy data improves baseline performance.

02

Synthetic labels are comparable to human labels in effectiveness.

03

Framework is integrated into real-world fact-checking tools.

Abstract

Diaspora communities are disproportionately impacted by off-the-radar misinformation and often neglected by mainstream fact-checking efforts, creating a critical need to scale-up efforts of nascent fact-checking initiatives. In this paper we present SynDy, a framework for Synthetic Dynamic Dataset Generation to leverage the capabilities of the largest frontier Large Language Models (LLMs) to train local, specialized language models. To the best of our knowledge, SynDy is the first paper utilizing LLMs to create fine-grained synthetic labels for tasks of direct relevance to misinformation mitigation, namely Claim Matching, Topical Clustering, and Claim Relationship Classification. SynDy utilizes LLMs and social media queries to automatically generate distantly-supervised, topically-focused datasets with synthetic labels on these three tasks, providing essential tools to scale up…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.