YT-30M: A multi-lingual multi-category dataset of YouTube comments
Hridoy Sankar Dutta

TL;DR
This paper presents large-scale multilingual datasets of YouTube comments, YT-30M and YT-100K, to facilitate research in comment analysis across multiple languages and categories.
Contribution
Introduces and publicly releases the YT-30M and YT-100K datasets, enabling large-scale multilingual comment analysis from YouTube videos.
Findings
Datasets contain over 32 million comments in multiple languages.
Comments are categorized by YouTube channel categories.
Datasets support research in multilingual and multi-category comment analysis.
Abstract
This paper introduces two large-scale multilingual comment datasets, YT-30M (and YT-100K) from YouTube. The analysis in this paper is performed on a smaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and YT-100K (randomly selected 100K sample from YT-30M) are publicly released for further research. YT-30M (YT-100K) contains 32236173 (108694) comments posted by YouTube channel that belong to YouTube categories. Each comment is associated with a video ID, comment ID, commentor name, commentor channel ID, comment text, upvotes, original channel ID and category of the YouTube channel (e.g., 'News & Politics', 'Science & Technology', etc.).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Text and Document Classification Technologies
