Combining datasets to increase the number of samples and improve model fitting
Thu Nguyen, Rabindra Khadka, Nhan Phan, Anis Yazidi, P{\aa}l, Halvorsen, Michael A. Riegler

TL;DR
This paper introduces ComImp, a novel framework for combining datasets with different features to enhance machine learning model performance, especially useful for small datasets, by imputing missing data and reducing dimensionality.
Contribution
The paper proposes ComImp and PCA-ComImp frameworks that effectively combine datasets with non-overlapping features and missing data, improving model accuracy and enabling better transfer learning.
Findings
Significant accuracy improvements when combining small datasets.
Effective data imputation and dimensionality reduction techniques.
Enhanced transfer learning performance with combined datasets.
Abstract
For many use cases, combining information from different datasets can be of interest to improve a machine learning model's performance, especially when the number of samples from at least one of the datasets is small. However, a potential challenge in such cases is that the features from these datasets are not identical, even though there are some commonly shared features among the datasets. To tackle this challenge, we propose a novel framework called Combine datasets based on Imputation (ComImp). In addition, we propose a variant of ComImp that uses Principle Component Analysis (PCA), PCA-ComImp in order to reduce dimension before combining datasets. This is useful when the datasets have a large number of features that are not shared between them. Furthermore, our framework can also be utilized for data preprocessing by imputing missing data, i.e., filling in the missing entries while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Data Stream Mining Techniques · Anomaly Detection Techniques and Applications
