CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research

Vladislav Savenkov

arXiv:2605.12153·cs.SE·May 13, 2026

CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research

Vladislav Savenkov

PDF

TL;DR

CIDR is a large, proprietary industrial source code dataset from 12 organizations, designed to advance software engineering research with diverse, real-world codebases and comprehensive metadata.

Contribution

The paper introduces CIDR, a large-scale proprietary dataset of industrial code repositories with structured metadata, anonymization, and quality filtering, unavailable in public datasets.

Findings

01

Contains 2,440 repositories across 138 languages.

02

Includes 373 million lines of code from real-world industrial projects.

03

Supports research in code models, quality analysis, and developer behavior.

Abstract

We present Curated Industrial Developer Repository (CIDR), a large-scale dataset of real-world software repositories collected through direct collaboration with 12 industrial partner organizations. The dataset comprises 2,440 repositories spanning 138 programming languages and totalling 373 million lines of code, accompanied by structured per-repository metadata. Unlike existing code corpora derived from public open-source platforms, CIDR consists exclusively of proprietary production codebases contributed under formal data sharing agreements, covering application domains including enterprise web and mobile development, fintech, and custom software consultancy. All repositories were processed through a multi-stage pipeline encompassing structured partner onboarding, two-stage quality selection combining automated metadata filtering with manual code review, and a deterministic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.