TL;DR
This paper introduces a semiparametric sparse canonical correlation analysis method using Gaussian copula, effectively handling high-dimensional, mixed data types with excess zeros, demonstrated on gene expression and microRNA data.
Contribution
It proposes a novel truncated latent Gaussian copula model for mixed data with zeros, enabling rank-based estimation without marginal transformations.
Findings
Performs well in high-dimensional simulations
Successfully applied to breast cancer gene and microRNA data
Outperforms existing methods in mixed data analysis
Abstract
Canonical correlation analysis investigates linear relationships between two sets of variables, but often works poorly on modern data sets due to high-dimensionality and mixed data types such as continuous, binary and zero-inflated. To overcome these challenges, we propose a semiparametric approach for sparse canonical correlation analysis based on Gaussian copula. Our main contribution is a truncated latent Gaussian copula model for data with excess zeros, which allows us to derive a rank-based estimator of the latent correlation matrix for mixed variable types without the estimation of marginal transformation functions. The resulting canonical correlation analysis method works well in high-dimensional settings as demonstrated via numerical studies, as well as in application to the analysis of association between gene expression and micro RNA data of breast cancer patients.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
