YT-30M: A multi-lingual multi-category dataset of YouTube comments

Hridoy Sankar Dutta

arXiv:2412.03465·cs.SI·December 5, 2024

YT-30M: A multi-lingual multi-category dataset of YouTube comments

Hridoy Sankar Dutta

PDF

Open Access

TL;DR

This paper presents large-scale multilingual datasets of YouTube comments, YT-30M and YT-100K, to facilitate research in comment analysis across multiple languages and categories.

Contribution

Introduces and publicly releases the YT-30M and YT-100K datasets, enabling large-scale multilingual comment analysis from YouTube videos.

Findings

01

Datasets contain over 32 million comments in multiple languages.

02

Comments are categorized by YouTube channel categories.

03

Datasets support research in multilingual and multi-category comment analysis.

Abstract

This paper introduces two large-scale multilingual comment datasets, YT-30M (and YT-100K) from YouTube. The analysis in this paper is performed on a smaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and YT-100K (randomly selected 100K sample from YT-30M) are publicly released for further research. YT-30M (YT-100K) contains 32236173 (108694) comments posted by YouTube channel that belong to YouTube categories. Each comment is associated with a video ID, comment ID, commentor name, commentor channel ID, comment text, upvotes, original channel ID and category of the YouTube channel (e.g., 'News & Politics', 'Science & Technology', etc.).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Text and Document Classification Technologies