SynDy: Synthetic Dynamic Dataset Generation Framework for Misinformation Tasks
Michael Shliselberg, Ashkan Kazemi, Scott A. Hale, Shiri, Dori-Hacohen

TL;DR
SynDy is a novel framework that uses large language models to generate synthetic, labeled datasets for misinformation detection tasks, significantly reducing the need for costly human annotation and aiding fact-checking efforts in underserved communities.
Contribution
It introduces the first use of LLMs for creating fine-grained synthetic labels for misinformation-related tasks, enhancing data availability for training specialized models.
Findings
Training on SynDy data improves baseline performance.
Synthetic labels are comparable to human labels in effectiveness.
Framework is integrated into real-world fact-checking tools.
Abstract
Diaspora communities are disproportionately impacted by off-the-radar misinformation and often neglected by mainstream fact-checking efforts, creating a critical need to scale-up efforts of nascent fact-checking initiatives. In this paper we present SynDy, a framework for Synthetic Dynamic Dataset Generation to leverage the capabilities of the largest frontier Large Language Models (LLMs) to train local, specialized language models. To the best of our knowledge, SynDy is the first paper utilizing LLMs to create fine-grained synthetic labels for tasks of direct relevance to misinformation mitigation, namely Claim Matching, Topical Clustering, and Claim Relationship Classification. SynDy utilizes LLMs and social media queries to automatically generate distantly-supervised, topically-focused datasets with synthetic labels on these three tasks, providing essential tools to scale up…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
