Collinear datasets augmentation using Procrustes validation sets
Sergey Kucheryavskiy, Sergei Zhilin

TL;DR
This paper introduces a simple, fast augmentation method for numeric and mixed datasets with collinearity, improving model performance in real-world applications like protein prediction and patient discrimination.
Contribution
The paper presents a novel augmentation technique leveraging collinearity and cross-validation, with minimal parameters and broad applicability to numeric and mixed datasets.
Findings
Significant reduction in prediction error for meat protein dataset.
Improved model accuracy in patient discrimination tasks.
Method effective for datasets with moderate to high collinearity.
Abstract
In this paper, we propose a new method for the augmentation of numeric and mixed datasets. The method generates additional data points by utilizing cross-validation resampling and latent variable modeling. It is particularly efficient for datasets with moderate to high degrees of collinearity, as it directly utilizes this property for generation. The method is simple, fast, and has very few parameters, which, as shown in the paper, do not require specific tuning. It has been tested on several real datasets; here, we report detailed results for two cases, prediction of protein in minced meat based on near infrared spectra (fully numeric data with high degree of collinearity) and discrimination of patients referred for coronary angiography (mixed data, with both numeric and categorical variables, and moderate collinearity). In both cases, artificial neural networks were employed for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpectroscopy and Chemometric Analyses · Meat and Animal Product Quality · Spectroscopy Techniques in Biomedical and Chemical Research
MethodsSparse Evolutionary Training
