Scatter Matrix Concordance: A Diagnostic for Regressions on Subsets of   Data

Michael J. Kane; Bryan Lewis; Sekhar Tatikonda; Simon Urbanek

arXiv:1507.03285·stat.ML·July 23, 2019

Scatter Matrix Concordance: A Diagnostic for Regressions on Subsets of Data

Michael J. Kane, Bryan Lewis, Sekhar Tatikonda, Simon Urbanek

PDF

TL;DR

This paper introduces a simple concordance measure to evaluate how well a subset of data captures the variance-covariance structure of the full dataset, aiding in efficient subset selection for regression models.

Contribution

It proposes a new concordance measure for design matrices and demonstrates its use in selecting data partitions that balance statistical accuracy and computational efficiency.

Findings

01

The concordance measure effectively assesses subset representativeness.

02

Using the measure improves subset selection for large-scale regressions.

03

The method balances statistical fidelity with computational speed.

Abstract

Linear regression models depend directly on the design matrix and its properties. Techniques that efficiently estimate model coefficients by partitioning rows of the design matrix are increasingly popular for large-scale problems because they fit well with modern parallel computing architectures. We propose a simple measure of {\em concordance} between a design matrix and a subset of its rows that estimates how well a subset captures the variance-covariance structure of a larger data set. We illustrate the use of this measure in a heuristic method for selecting row partition sizes that balance statistical and computational efficiency goals in real-world problems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.