Data-fusion using factor analysis and low-rank matrix completion
Daniel Ahfock, Saumyadipta Pyne, Geoffrey J. McLachlan

TL;DR
This paper introduces a novel approach combining factor analysis and low-rank matrix completion to improve data-fusion in statistical file-matching, with theoretical guarantees and practical advantages demonstrated on real datasets.
Contribution
It proves the identifiability of the factor analysis model in file-matching and develops an EM algorithm for effective covariance estimation from incomplete data.
Findings
Factor analysis-based method outperforms traditional low-rank completion in reconstruction error.
Theoretical conditions for model identifiability are established.
Empirical results on real datasets validate the approach's effectiveness.
Abstract
Data-fusion involves the integration of multiple related datasets. The statistical file-matching problem is a canonical data-fusion problem in multivariate analysis, where the objective is to characterise the joint distribution of a set of variables when only strict subsets of marginal distributions have been observed. Estimation of the covariance matrix of the full set of variables is challenging given the missing-data pattern. Factor analysis models use lower-dimensional latent variables in the data-generating process, and this introduces low-rank components in the complete-data matrix and the population covariance matrix. The low-rank structure of the factor analysis model can be exploited to estimate the full covariance matrix from incomplete data via low-rank matrix completion. We prove the identifiability of the factor analysis model in the statistical file-matching problem under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
