SOTorrent: Studying the Origin, Evolution, and Usage of Stack Overflow Code Snippets
Sebastian Baltes, Christoph Treude, Stephan Diehl

TL;DR
SOTorrent is a comprehensive dataset that tracks the detailed evolution of code snippets and text on Stack Overflow, enabling researchers to analyze how community-shared code changes over time and across platforms.
Contribution
The paper introduces SOTorrent, an open dataset that captures version histories of Stack Overflow posts at the code and text level, linking them to external platforms like GitHub.
Findings
Provides detailed version history of SO code snippets and posts.
Connects SO snippets to external platforms via URLs and references.
Enables analysis of code evolution and maintenance on SO.
Abstract
Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of copyable code snippets. Like other software artifacts, code on SO evolves over time, for example when bugs are fixed or APIs are updated to the most recent version. To be able to analyze how code and the surrounding text on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text and code blocks. It connects code snippets from SO posts to other platforms by aggregating URLs from surrounding text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution and maintenance of code on SO and its relation to other platforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
