WarCov -- Large multilabel and multimodal dataset from social platform

Weronika Borek-Marciniec; Pawel Zyblewski; Jakub Klikowski; Pawel; Ksieniewicz

arXiv:2406.10255·cs.CL·June 18, 2024

WarCov -- Large multilabel and multimodal dataset from social platform

Weronika Borek-Marciniec, Pawel Zyblewski, Jakub Klikowski, Pawel, Ksieniewicz

PDF

Open Access 1 Repo

TL;DR

This paper introduces WarCov, a large multimodal dataset of Polish social media posts about COVID-19 and Ukraine war, including texts and images, designed for evaluating machine learning models in evolving NLP contexts.

Contribution

It presents a new, sizable multilingual dataset with multimodal data and labels derived from hashtags, along with the process of dataset creation and initial experiments.

Findings

01

Dataset contains over 3 million posts with labels.

02

Includes both text and images for multimodal tasks.

03

Demonstrates dataset utility through initial pattern recognition experiments.

Abstract

In the classification tasks, from raw data acquisition to the curation of a dataset suitable for use in evaluating machine learning models, a series of steps - often associated with high costs - are necessary. In the case of Natural Language Processing, initial cleaning and conversion can be performed automatically, but obtaining labels still requires the rationalized input of human experts. As a result, even though many articles often state that "the world is filled with data", data scientists suffer from its shortage. It is crucial in the case of natural language applications, which is constantly evolving and must adapt to new concepts or events. For example, the topic of the COVID-19 pandemic and the vocabulary related to it would have been mostly unrecognizable before 2019. For this reason, creating new datasets, also in languages other than English, is still essential. This work…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

w4k2/warcow
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Advanced Text Analysis Techniques · Text and Document Classification Technologies