Scatter Matrix Concordance: A Diagnostic for Regressions on Subsets of Data
Michael J. Kane, Bryan Lewis, Sekhar Tatikonda, Simon Urbanek

TL;DR
This paper introduces a simple concordance measure to evaluate how well a subset of data captures the variance-covariance structure of the full dataset, aiding in efficient subset selection for regression models.
Contribution
It proposes a new concordance measure for design matrices and demonstrates its use in selecting data partitions that balance statistical accuracy and computational efficiency.
Findings
The concordance measure effectively assesses subset representativeness.
Using the measure improves subset selection for large-scale regressions.
The method balances statistical fidelity with computational speed.
Abstract
Linear regression models depend directly on the design matrix and its properties. Techniques that efficiently estimate model coefficients by partitioning rows of the design matrix are increasingly popular for large-scale problems because they fit well with modern parallel computing architectures. We propose a simple measure of {\em concordance} between a design matrix and a subset of its rows that estimates how well a subset captures the variance-covariance structure of a larger data set. We illustrate the use of this measure in a heuristic method for selecting row partition sizes that balance statistical and computational efficiency goals in real-world problems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
