The General Index of Software Engineering Papers
Zeinab Abou Khalil (DGD-I), Stefano Zacchiroli (IP Paris, LTCI)

TL;DR
This paper introduces a comprehensive, open dataset of indexed software engineering papers spanning 1971-2020, enabling meta-research and reproducibility in the field.
Contribution
It provides the first large-scale, full-text indexed dataset of software engineering papers, facilitating meta-analyses and independent verification of research outputs.
Findings
Contains 577 million unique n-grams from 44,581 papers
Enables meta-research and reproducibility in software engineering
Accessible as open Postgres database dump
Abstract
We introduce the General Index of Software Engineering Papers, a dataset of fulltext-indexed papers from the most prominent scientific venues in the field of Software Engineering. The dataset includes both complete bibliographic information and indexed ngrams (sequence of contiguous words after removal of stopwords and non-words, for a total of 577 276 382 unique n-grams in this release) with length 1 to 5 for 44 581 papers retrieved from 34 venues over the 1971-2020 period.The dataset serves use cases in the field of meta-research, allowing to introspect the output of software engineering research even when access to papers or scholarly search engines is not possible (e.g., due to contractual reasons). The dataset also contributes to making such analyses reproducible and independently verifiable, as opposed to what happens when they are conducted using 3rd-party and non-open scholarly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
