CC30k: A Citation Contexts Dataset for Reproducibility-Oriented Sentiment Analysis
Rochana R. Obadage, Sarah M. Rajtmajer, Jian Wu

TL;DR
The paper introduces CC30k, a large dataset of citation contexts labeled with reproducibility-oriented sentiments, enabling better prediction and analysis of scientific reproducibility in machine learning literature.
Contribution
It provides a novel, large-scale dataset focused on reproducibility sentiments, filling a gap in resources for computational reproducibility research.
Findings
Large language models improve in sentiment classification after fine-tuning on CC30k.
The dataset achieves 94% labeling accuracy.
CC30k enables large-scale reproducibility assessment of ML papers.
Abstract
Sentiments about the reproducibility of cited papers in downstream literature offer community perspectives and have shown as a promising signal of the actual reproducibility of published findings. To train effective models to effectively predict reproducibility-oriented sentiments and further systematically study their correlation with reproducibility, we introduce the CC30k dataset, comprising a total of 30,734 citation contexts in machine learning papers. Each citation context is labeled with one of three reproducibility-oriented sentiment labels: Positive, Negative, or Neutral, reflecting the cited paper's perceived reproducibility or replicability. Of these, 25,829 are labeled through crowdsourcing, supplemented with negatives generated through a controlled pipeline to counter the scarcity of negative labels. Unlike traditional sentiment analysis datasets, CC30k focuses on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMeta-analysis and systematic reviews · Biomedical Text Mining and Ontologies · Scientific Computing and Data Management
