CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research
Vladislav Savenkov

TL;DR
CIDR is a large, proprietary industrial source code dataset from 12 organizations, designed to advance software engineering research with diverse, real-world codebases and comprehensive metadata.
Contribution
The paper introduces CIDR, a large-scale proprietary dataset of industrial code repositories with structured metadata, anonymization, and quality filtering, unavailable in public datasets.
Findings
Contains 2,440 repositories across 138 languages.
Includes 373 million lines of code from real-world industrial projects.
Supports research in code models, quality analysis, and developer behavior.
Abstract
We present Curated Industrial Developer Repository (CIDR), a large-scale dataset of real-world software repositories collected through direct collaboration with 12 industrial partner organizations. The dataset comprises 2,440 repositories spanning 138 programming languages and totalling 373 million lines of code, accompanied by structured per-repository metadata. Unlike existing code corpora derived from public open-source platforms, CIDR consists exclusively of proprietary production codebases contributed under formal data sharing agreements, covering application domains including enterprise web and mobile development, fintech, and custom software consultancy. All repositories were processed through a multi-stage pipeline encompassing structured partner onboarding, two-stage quality selection combining automated metadata filtering with manual code review, and a deterministic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
