Statistical Learning Theory in Lean 4: Empirical Processes from Scratch

Yuanhe Zhang; Jason D. Lee; Fanghui Liu

arXiv:2602.02285·cs.LG·February 3, 2026

Statistical Learning Theory in Lean 4: Empirical Processes from Scratch

Yuanhe Zhang, Jason D. Lee, Fanghui Liu

PDF

Open Access 3 Datasets

TL;DR

This paper develops a comprehensive formalization of statistical learning theory in Lean 4, including new formal proofs of key theorems and applications to regression, enhancing rigor and reproducibility in ML theory.

Contribution

It provides the first complete Lean 4 formalization of SLT, including Gaussian concentration, Dudley's entropy theorem, and regression applications, filling gaps in existing libraries.

Findings

01

Formalization exposes implicit assumptions in SLT textbooks.

02

Provides a reusable Lean 4 toolbox for statistical learning theory.

03

Achieves a granular, verified understanding of key SLT results.

Abstract

We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our end-to-end formal infrastructure implement the missing contents in latest Lean 4 Mathlib library, including a complete development of Gaussian Lipschitz concentration, the first formalization of Dudley's entropy integral theorem for sub-Gaussian processes, and an application to least-squares (sparse) regression with a sharp rate. The project was carried out using a human-AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, leading to the human-verified Lean 4 toolbox for SLT. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference · Statistical Mechanics and Entropy · Adversarial Robustness in Machine Learning