PoPreRo: A New Dataset for Popularity Prediction of Romanian Reddit Posts
Ana-Cristina Rogoz, Maria Ilinca Nechita, Radu Tudor Ionescu

TL;DR
PoPreRo is a newly introduced Romanian Reddit dataset for predicting post popularity, providing a challenging benchmark with baseline models and encouraging future research in this niche.
Contribution
The paper presents the first Romanian Reddit dataset for popularity prediction and establishes baseline models, highlighting the task's difficulty and potential for future research.
Findings
Top model achieves 61.35% accuracy
Popularity prediction is highly challenging
Few-shot prompting with Falcon-7B confirms difficulty
Abstract
We introduce PoPreRo, the first dataset for Popularity Prediction of Romanian posts collected from Reddit. The PoPreRo dataset includes a varied compilation of post samples from five distinct subreddits of Romania, totaling 28,107 data samples. Along with our novel dataset, we introduce a set of competitive models to be used as baselines for future research. Interestingly, the top-scoring model achieves an accuracy of 61.35% and a macro F1 score of 60.60% on the test set, indicating that the popularity prediction task on PoPreRo is very challenging. Further investigations based on few-shot prompting the Falcon-7B Large Language Model also point in the same direction. We thus believe that PoPreRo is a valuable resource that can be used to evaluate models on predicting the popularity of social media posts in Romanian. We release our dataset at https://github.com/ana-rogoz/PoPreRo.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Advanced Text Analysis Techniques · Computational and Text Analysis Methods
MethodsSparse Evolutionary Training
