WarCov -- Large multilabel and multimodal dataset from social platform
Weronika Borek-Marciniec, Pawel Zyblewski, Jakub Klikowski, Pawel, Ksieniewicz

TL;DR
This paper introduces WarCov, a large multimodal dataset of Polish social media posts about COVID-19 and Ukraine war, including texts and images, designed for evaluating machine learning models in evolving NLP contexts.
Contribution
It presents a new, sizable multilingual dataset with multimodal data and labels derived from hashtags, along with the process of dataset creation and initial experiments.
Findings
Dataset contains over 3 million posts with labels.
Includes both text and images for multimodal tasks.
Demonstrates dataset utility through initial pattern recognition experiments.
Abstract
In the classification tasks, from raw data acquisition to the curation of a dataset suitable for use in evaluating machine learning models, a series of steps - often associated with high costs - are necessary. In the case of Natural Language Processing, initial cleaning and conversion can be performed automatically, but obtaining labels still requires the rationalized input of human experts. As a result, even though many articles often state that "the world is filled with data", data scientists suffer from its shortage. It is crucial in the case of natural language applications, which is constantly evolving and must adapt to new concepts or events. For example, the topic of the COVID-19 pandemic and the vocabulary related to it would have been mostly unrecognizable before 2019. For this reason, creating new datasets, also in languages other than English, is still essential. This work…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Advanced Text Analysis Techniques · Text and Document Classification Technologies
