The OCEAN mailing list data set: Network analysis spanning mailing lists and code repositories
Melanie Warrick, Samuel F. Rosenblatt, Jean-Gabriel Young, Amanda, Casari, Laurent H\'ebert-Dufresne, James Bagrow

TL;DR
This paper presents a standardized, large-scale dataset of mailing list communications for open source projects, combining social and technical data to facilitate sociotechnical analysis and theory testing.
Contribution
It introduces a comprehensive mailing list dataset for multiple open source communities, merging social and technical data, and demonstrates its utility through analysis of the Python community.
Findings
33% of GitHub contributors identified in mailing lists
Correlation between social messaging valence and collaboration network structure
Dataset enables testing organizational science theories in open source
Abstract
Communication surrounding the development of an open source project largely occurs outside the software repository itself. Historically, large communities often used a collection of mailing lists to discuss the different aspects of their projects. Multimodal tool use, with software development and communication happening on different channels, complicates the study of open source projects as a sociotechnical system. Here, we combine and standardize mailing lists of the Python community, resulting in 954,287 messages from 1995 to the present. We share all scraping and cleaning code to facilitate reproduction of this work, as well as smaller datasets for the Golang (122,721 messages), Angular (20,041 messages) and Node.js (12,514 messages) communities. To showcase the usefulness of these data, we focus on the CPython repository and merge the technical layer (which GitHub account works on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Network Analysis Techniques · Open Source Software Innovations · Wikis in Education and Collaboration
