TL;DR
This paper presents a large, curated dataset of agentic AI coding tool configurations from open-source repositories, enabling research on AI tool usage, context engineering, and human-AI collaboration.
Contribution
It introduces a comprehensive dataset of 15,591 configuration artifacts and 148,519 AI-co-authored commits across multiple AI coding tools, filling a gap in available resources.
Findings
Dataset covers 4,738 repositories across five AI tools.
Includes full content of 18,167 configuration files.
Supports research on AI tool adoption and collaboration patterns.
Abstract
Agentic AI coding tools such as Claude Code and OpenAI Codex execute multi-step coding tasks with limited human oversight. To steer these tools, developers create repository-level configuration artifacts (e.g., Markdown files) for configuration mechanisms such as Context Files, Skills, Rules, and Hooks. There is no curated dataset yet that captures these configurations at scale. This dataset, collected from open-source GitHub repositories, fills that gap. We selected 40,585 actively maintained repositories through metadata filtering, classified them using GPT-5.2 to identify 36,710 as belonging to engineered software projects, and systematically detected configuration artifacts in these repositories. The dataset covers 4,738 repositories across five tools (Claude Code, GitHub Copilot, OpenAI Codex, Cursor, Gemini) and eight configuration mechanisms. We collected 15,591 configuration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
