Combining datasets to increase the number of samples and improve model   fitting

Thu Nguyen; Rabindra Khadka; Nhan Phan; Anis Yazidi; P{\aa}l; Halvorsen; Michael A. Riegler

arXiv:2210.05165·stat.ML·May 17, 2023·5 cites

Combining datasets to increase the number of samples and improve model fitting

Thu Nguyen, Rabindra Khadka, Nhan Phan, Anis Yazidi, P{\aa}l, Halvorsen, Michael A. Riegler

PDF

Open Access 1 Repo

TL;DR

This paper introduces ComImp, a novel framework for combining datasets with different features to enhance machine learning model performance, especially useful for small datasets, by imputing missing data and reducing dimensionality.

Contribution

The paper proposes ComImp and PCA-ComImp frameworks that effectively combine datasets with non-overlapping features and missing data, improving model accuracy and enabling better transfer learning.

Findings

01

Significant accuracy improvements when combining small datasets.

02

Effective data imputation and dimensionality reduction techniques.

03

Enhanced transfer learning performance with combined datasets.

Abstract

For many use cases, combining information from different datasets can be of interest to improve a machine learning model's performance, especially when the number of samples from at least one of the datasets is small. However, a potential challenge in such cases is that the features from these datasets are not identical, even though there are some commonly shared features among the datasets. To tackle this challenge, we propose a novel framework called Combine datasets based on Imputation (ComImp). In addition, we propose a variant of ComImp that uses Principle Component Analysis (PCA), PCA-ComImp in order to reduce dimension before combining datasets. This is useful when the datasets have a large number of features that are not shared between them. Furthermore, our framework can also be utilized for data preprocessing by imputing missing data, i.e., filling in the missing entries while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunguyen177/comimp
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Data Stream Mining Techniques · Anomaly Detection Techniques and Applications