SOTorrent: Reconstructing and Analyzing the Evolution of Stack Overflow Posts
Sebastian Baltes, Lorik Dumani, Christoph Treude, Stephan Diehl

TL;DR
SOTorrent is an open dataset that captures the detailed version history of Stack Overflow posts, enabling analysis of how questions and answers evolve over time, including code and text edits, and their connections to other platforms.
Contribution
This paper introduces SOTorrent, a novel dataset that reconstructs Stack Overflow post histories at fine granularity and evaluates string similarity metrics for accurate version reconstruction.
Findings
Post edits are typically small and occur shortly after creation.
Code snippets are rarely changed without editing surrounding text.
There is a strong correlation between post edits and comments.
Abstract
Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and by collecting references from GitHub files to SO posts. In this paper, we describe how we built SOTorrent, and in particular how we evaluated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
