SOLVE: A structured orthogonal latent variable framework for disentangling confounding in matrix data

Jialai She; Gil Alterovitz

PMC · DOI:10.1093/biomethods/bpaf094·January 28, 2026

SOLVE: A structured orthogonal latent variable framework for disentangling confounding in matrix data

Jialai She, Gil Alterovitz

PDF

Open Access

TL;DR

This paper introduces a new method to separate known and hidden factors in matrix data, improving the accuracy of gene-drug associations in pharmacogenomics.

Contribution

A novel structured orthogonal latent variable framework that disentangles confounding in matrix data with improved identifiability and interpretability.

Findings

01

The method recovers biologically coherent gene-drug associations missed by standard models, such as the EGFR-inhibitor link.

02

It identifies novel gene programs aligned with drug mechanisms, including a latent unfolded-protein-response module affecting drug sensitivity.

03

The framework provides valid inference on feature-outcome associations and improves biomarker discovery for precision oncology.

Abstract

Latent factor models are valuable in bioinformatics for accounting for unmeasured variation alongside observed covariates. Yet many methods struggle to separate known effects from latent structure and to handle losses beyond standard regression. We present a unified framework that augments row and column predictors with a low-rank latent component, jointly modeling measured effects and residual variation. To remove ambiguity in estimating observed and latent effects, we impose a carefully designed set of orthogonality constraints on the coefficient and latent factor matrices, relative to the spans of the predictor matrices. These constraints ensure identifiability, yield a decomposition in which the latent term captures only variation unexplained by the covariates, and improve interpretability. An efficient algorithm handles general non-quadratic losses via surrogates with monotone…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes1

EGFR

Proteins1

Species1

Homo sapiens(human · species)

Figures4

Click any figure to enlarge with its caption.

Illustration of the elbow method for selecting matrix rank in both regression (left) and logistic regression (right) models. Simulation uses n=100,m=100,p=5,q=4 and true rank r=3.

CCLE data. The top panel provides a global overview of the effect size and significance of all estimated gene-drug coefficients (A^) by SOLVE. The bottom panel shows bootstrap distributions for 10 top discoveries by SOLVE that were missed by CATE (CATE P-values > .98, SOLVE P-values < 5e-3), where row labels denote specific gene-drug pairs.

GCSI data. Comparison of SOLVE and standard logistic regression (without latent factors) in identifying significant gene-drug pairs (formatted as Gene Name—Drug Name). Top: Raincloud plots of the top significant gene-drug associations identified by SOLVE but missed by standard logistic regression. Bottom: Examples where logistic regression findings were corrected by SOLVE to lower significance.

Heatmap of the estimated latent factor matrix of unit rank. Row labels display the names of GCSI cancer cell lines, annotated with their corresponding tissue of origin (disease type). Disease abbreviations: “RCC” (Renal Clear Cell Carcinoma), “AdenoCa” (Adenocarcinoma), “Inv” (Invasive), and “Ca” (Carcinoma). Column labels display the therapeutic compounds corresponding to the drug-response endpoints (binarized from gCSI AUC measurements for classification). The color scale indicates the magnitude of the latent factor score.

Tables2

Table 1.. Simulation results comparing data-driven SOLVE, SVA, and CATE under regression (with varying noise levels) and logistic regression.

		Method
	Metric	SOLVE	SVA	CATE
$σ^{2} = 9$	Error on A	7.71 $\pm$ 0.16	24.12 $\pm$ 0.05	30.38 $\pm$ 0.52
	Error on B	8.88 $\pm$ 0.16	28.70 $\pm$ 0.24	28.70 $\pm$ 0.24
	Error on $η$	9.28 $\pm$ 0.09	16.82 $\pm$ 0.04	9.29 $\pm$ 0.09
	Rank recovery	100%	100.0%	100.0%
	Time	3.6	28	246
$σ^{2} = 25$	Error on A	12.91 $\pm$ 0.23	26.19 $\pm$ 0.11	32.21 $\pm$ 0.72
	Error on B	14.79 $\pm$ 0.27	31.18 $\pm$ 0.31	31.18 $\pm$ 0.31
	Error on $η$	15.85 $\pm$ 0.17	20.21 $\pm$ 0.13	15.87 $\pm$ 0.17
	Rank recovery	100%	92.0%	100.0%
	Time	3.6	31	282
$σ^{2} = 64$	Error on A	20.68 $\pm$ 0.32	30.80 $\pm$ 0.29	30.88 $\pm$ 0.61
	Error on B	23.66 $\pm$ 0.50	36.47 $\pm$ 0.62	36.47 $\pm$ 0.62
	Error on $η$	26.56 $\pm$ 0.58	28.17 $\pm$ 0.92	25.22 $\pm$ 0.24
	Rank recovery	68.0%	8.0%	0.0%
	Time	3.6	32	275
Bernoulli	Error on A	34.29 $\pm$ 0.30	—	—
	Error on B	35.32 $\pm$ 0.37	—	—
	Error on $η$	36.34 $\pm$ 0.35	—	—
	Rank recovery	86.0%	—	—

Table 2.. Average computational time (in seconds) for SOLVE, SVA, and CATE across varying data dimensions (n,m).

	Mean Runtime (seconds)
	SOLVE	SVA	CATE
$n = 200$ , $m = 150$	0.090	0.469	4.220
$n = 200$ , $m = 300$	0.184	1.316	11.477
$n = 400$ , $m = 150$	0.143	1.774	11.648
$n = 400$ , $m = 300$	0.572	3.703	25.455

Equations17

Keywords

latent factor modelscomputational biologyidentifiability constraintslow-rank matrix factorizationconfounding adjustmentmatrix outcomes

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Advanced Causal Inference Techniques · Bioinformatics and Genomic Networks

Full text

Introduction

In many scientific fields, a central goal is to infer the effect of a predictor of interest, $[eqn]$ , from a primary predictor matrix $[eqn]$ , on a particular outcome, $[eqn]$ , from an outcome matrix $[eqn]$ , often leveraging auxiliary column-specific data, Z. However, such analyses are frequently compromised by unobserved confounders—latent variables that distort estimation and inference and induce spurious associations [1, 2].

Our primary motivating example comes from Precision Oncology, a modern approach to cancer treatment that tailors therapies to the unique molecular profile of a patient’s tumor. Implementing this paradigm relies on preclinical discovery tools, most notably large-scale pharmacogenomic screens, such as the Cancer Cell Line Encyclopedia (CCLE) [3]. These screens systematically test drugs on hundreds of cancer cell lines—each representing a different tumor type—to discover predictive biomarkers: genomic features that predict a drug’s effectiveness.

The data from these studies yield three key matrices. First, an outcome matrix Y, where each entry $[eqn]$ measures the sensitivity (or effectiveness) of the k-th drug on the i-th cancer cell line. This outcome can be a continuous measure, such as the half-maximal inhibitory concentration (IC50) or the area under the dose-response curve (AUC)—for both metrics, lower values indicate higher sensitivity. The outcome can also be converted to a binary indicator (sensitive vs. resistant) for classification purposes [4]. Second, a matrix of row-specific primary predictors X, which contains the rich genomic features for each cell line, such as gene expression levels or the status of specific mutations. Finally, a matrix of column-specific auxiliary predictors Z encodes key drug properties, such as chemical structure or molecular targets.

However, a model based solely on observed features is insufficient because drug response is governed by a multitude of unobserved variables reflecting latent properties of both cell lines and drugs. For instance, a cell line’s response is often determined by the activation state of critical signaling pathways (e.g., MAPK, PI3K/AKT) or the influence of the tumor microenvironment—factors rarely measured directly in high-throughput screens [5, 6]. The presence of these influential but unobserved variables creates systematic confounding that can mask true drug-gene associations or create false ones.

Related work

Existing methods for handling multivariate confounding often fall short when the primary goal is accurate inference on specific predictor-outcome relationships (X to Y). Standard regression models simply ignore latent variables, while popular adjustment strategies typically follow a multi-step, ad hoc process—often employing principal component analysis (PCA) to extract latent factors from outcome data after accounting for known predictors [7]. For example, SVA [8] addresses hidden sources of variation by regressing out known predictors, decomposing the residuals, selecting informative gene subsets, and constructing surrogate variables for downstream analysis. PEER [9] uses Bayesian factor analysis with automatic relevance determination to infer the number of latent factors directly from the data. CATE [10] adjusts for confounding either through user-specified negative controls or, under a sparsity assumption, by robust regression on the bulk of the data.

However, these prevailing approaches have key drawbacks.

Tailored for continuous outcomes under squared-error loss, these procedures and implementations do not naturally extend to other data types, such as binary outcomes in classification, nor do they accommodate both row- and column-wise predictors.Neither SVA nor PEER strictly separates estimated latent factors from observed predictors, which can result in unidentifiable parameters and complicate interpretation.All these methods share a weakness rooted in their sequential design. By estimating the latent structure in a separate, subsequent step (rather than through joint optimization) they often fail to achieve global optimality. Indeed, as recognized in related fields such as panel data econometrics, these sequential procedures can lead to biased estimates of the coefficients of interest [11].

A unified SOLVE framework

The latent modeling of matrix outcomes presents significant challenges, but a key insight that makes this problem tractable is the assumption that the residual variation in Y (not explained by the observed predictors) has a low-dimensional structure, a premise rooted in the biological reality that cellular response is often orchestrated by a small number of master regulators [12]. This principle of hierarchical control implies that the collective effect of thousands of unmeasured molecular events can be parsimoniously captured by a small set of latent factors. For example, a cell line may have a latent dependency on a specific survival pathway, such as Wnt signaling, a phenomenon known as non-oncogene addiction [13], while a drug may exert potent but uncharacterized off-target effects against that same pathway [14]. The resulting strong sensitivity, observed only when this specific cellular vulnerability meets the drug’s latent mechanism, creates a coherent pattern of response across the dataset. The modularity of disease networks implies that just a handful of core pathways become dominant drivers of cell survival through dysregulation [15]. Aggregating such dominant vulnerability-mechanism pairings naturally forms a low-rank structure.

The proposed framework, Structured Orthogonal Latent Variable Estimation (SOLVE), overcomes aforementioned challenges by jointly and explicitly modeling the contributions of unobserved latent factors alongside the observed row- and column-specific predictors within a single, unified model.

First, a novel latent factor framework is introduced that integrates row predictors X, column predictors Z, and a low-rank latent component within a unified model. This formulation extends latent factor modeling beyond standard regression to accommodate a broad class of generalized linear models.To resolve model ambiguity, we impose carefully designed orthogonality constraints on the coefficient and latent factor matrices relative to the spans of the predictor matrices. These constraints ensure parameter identifiability and yield a unique decomposition where the latent structure captures only variation unexplained by known covariates, improving interpretabilityComputationally, an efficient optimization algorithm is proposed for general non-quadratic losses by constructing quadratic surrogates with guaranteed monotone descent and convergence. The method features closed-form updates: the latent factor is updated via truncated singular-value decomposition (SVD) of a doubly projected residual, while coefficient blocks are refined through simple projections, ensuring scalability to large datasets.Furthermore, for model selection and inference, the framework uses a data-driven elbow method on a penalized information criterion to select the optimal rank, and introduces a parametric bootstrap to compute valid P-values for the gene-drug coefficients—a task complicated by the regularized low-rank structure.Finally, applied to CCLE and GCSI, SOLVE recovers biologically coherent gene-drug associations that standard latent-factor models miss (e.g., EGFR expression linked to EGFR inhibitors) and identifies additional candidates whose activation signatures align with compound-specific modes of action. Interpreting the learned latent structure reveals an unfolded protein response (UPR) module driving drug sensitivity in GCSI—a known resistance pathway. Overall, SOLVE improves sensitivity and specificity and exposes latent biology beyond what measured covariates can explain.

Materials and methods

Datasets

Validation using simulated datasets

Synthetic datasets were constructed to evaluate a latent factor method’s performance under controlled conditions that mimic the structure of real pharmacogenomic data.

The data generation procedure is as follows. The row-specific predictors $[eqn]$ and column-specific predictors $[eqn]$ were drawn independently from a multivariate normal distribution $[eqn]$ , and the columns of Z were sampled from $[eqn]$ . Both $[eqn]$ and $[eqn]$ specified as Toeplitz matrices with entries $[eqn]$ , where we set the correlation parameter $[eqn]$ . The true underlying kernel coefficient matrices $[eqn]$ and $[eqn]$ (cf. (6)) were created with entries drawn from a standard normal distribution. The dimensions were set to be $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ . The latent structure kernel $[eqn]$ was formed as the product of two standard normal matrices to have a true rank of $[eqn]$ . The final coefficient matrices $[eqn]$ (associated with $[eqn]$ , respectively) and latent factor matrix $[eqn]$ were obtained using (6), and correspondingly, the systematic component $[eqn]$ (cf. (2)).

The outcome matrix Y was produced under two scenarios: (i) Regression. The response was given by $[eqn]$ , where the entries of the noise matrix E were drawn i.i.d. from $[eqn]$ . We tested three noise levels: $[eqn]$ . (ii) Classification. The response matrix Y was binary with each element $[eqn]$ following a Bernoulli law with parameter $[eqn]$ . For each setting, 50 independent datasets were generated in order to better average out noise and provide a robust evaluation.

Pharmacogenomic data: CCLE

The Cancer Cell Line Encyclopedia (CCLE) served as the first pharmacogenomic dataset in our study, a public resource cataloging both genomic profiles and drug responses for 148 human cancer cell lines [16]. Drug sensitivity was assessed on the basis of the area under the dose-response curve (AUC) for 24 compounds, with lower values indicating greater sensitivity. Baseline gene-expression profiles were obtained for all cell lines. To identify genes associated with drug response, we regressed the 24-drug response profile on individual gene expression, assessed significance using Wilks’ Lambda, and applied Bonferroni correction for multiple testing. This yielded 70 genes strongly associated with drug response, based on an adjusted P-value threshold of .05. For each drug, chemical and physicochemical descriptors were obtained through Dragon 7 software; a similar screening procedure yielded six key descriptors. Missing entries in the drug response matrix were imputed using the missForest algorithm, which leverages both gene expression and molecular descriptors as predictors [17].

Pharmacogenomic data: GCSI

The second pharmacogenomic dataset used in this study was from the Genentech Cell Line Screening Initiative (GCSI) [18]. A processed version available through the same MoDaC portal comprises 130 cancer cell lines, each tested against 16 drugs. Following a feature screening process similar to that for the CCLE data, 83 genes were retained for analysis; no significant molecular descriptors for the compounds were identified, resulting in a null drug-feature matrix. The task was formulated as a classification problem, for which drug responses measured as AUC were binarized into effective (AUC $[eqn]$ 0.7) or ineffective (AUC > 0.7); the threshold was selected based on [4] as a meaningful cutoff to identify cell lines responsive to treatment.

Model, regularization, and identifiability

Let $[eqn]$ be the response matrix of interest. For the $[eqn]$ th outcome of $[eqn]$ , a standard modeling approach in our setting is to use a linear model that incorporates row-specific predictors (e.g., sample features) and column-specific predictors (e.g., item features). Let $[eqn]$ be the predictor vector for the i-th row, and $[eqn]$ be the predictor vector for the k-th column with $[eqn]$ .

To simplify the discussion, we begin with a generalized linear model (GLM) [19]:

[eqn]

where g is a link function, and $[eqn]$ and $[eqn]$ are the corresponding coefficient vectors. For binary outcomes $[eqn]$ , the logistic link $[eqn]$ is commonly used, while for a regression model g is the identity function $[eqn]$ . In matrix form, this becomes the baseline model

[eqn]

where $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ .

A critical limitation of the baseline model is its failure to account for latent variables—unobserved factors that systematically influence the response. For example, in pharmacogenomics, a cell line may have a latent dependency on a specific survival pathway, while a drug may possess an uncharacterized potency against that same pathway. Similarly, in microbial genomics, it is necessary to account for factors like a species’ intrinsic metabolic fitness, as some species are inherently faster growers, and the general nutrient limitation of an environment. These unobserved effects create systematic patterns in the data that, if ignored, will distort estimation and inference for the model coefficients.

To address this, we introduce an adjustment matrix $[eqn]$ :

[eqn]

where the right-hand side represents the systematic component, often denoted by $[eqn]$ [19]. Matrix C is intended to capture the effects of latent factors—real but unmeasured entities. A latent factor might represent the activity of a specific cancer-related signaling pathway or correspond to an individual’s ancestry from a specific subpopulation.

To quantify the effects of these underlying causal mechanisms not captured by the primary predictors, we must impose a structural assumption on C. In the previous microbial genomics example, one could model the latent effect using an additive structure: $[eqn]$ with $[eqn]$ representing a species’ intrinsic fitness and $[eqn]$ the growth limitation of an environment, which creates a rank-2 matrix. Similarly, in the pharmacogenomics example, the unobserved effect could be modeled multiplicatively as $[eqn]$ , where $[eqn]$ is a cell line’s latent dependency on the pathway and $[eqn]$ is a drug’s latent potency against it. Generalizing from this, we assume this latent structure matrix C has low-rank at most r and $[eqn]$ This assumption formalizes the principle that after accounting for the primary predictors X and Z, only a few dominant confounding factors remain.

However, (2) introduces a new problem: model ambiguity. For instance, one can always write $[eqn]$ for any matrix G, where $[eqn]$ denotes the Moore–Penrose pseudoinverse of M and $[eqn]$ gives the orthogonal projector onto the column space of M. To resolve the issue, we enforce carefully chosen orthogonality conditions to disentangle observed and unobserved effects:

[eqn]

where $[eqn]$ and $[eqn]$ represent the column space (range) and row space of M, respectively. In particular, (ii) and (iii) ensures that C captures only the variation that cannot be explained by X or Z.

Imposing these constraints leads us to the model below, which we term Structured Orthogonal Latent Variable Estimation (SOLVE):

[eqn]

where $[eqn]$ are given, $[eqn]$ are unknown and C is of low rank. The term XA represents the component of Y that is best explained by the primary predictors X. The second term captures the variation explained by the auxiliary column predictors Z within the space orthogonal to X. Finally, the third term models the low-rank latent structure in the residual variation orthogonal to both X and Z. Not only does this orthogonal structure ensure that the unique contributions of the observed predictors and latent factors are identifiable for proper inference, but it also provides computational tractability (cf. Theorem 1).

For regression, where g is the identity function, (3) can be expressed as

[eqn]

For binary outcomes $[eqn]$ , the logistic link gives

[eqn]

subject to the same orthogonality constraints. This logistic model is relevant for classification problems common in bioinformatics.

Estimation, tuning, and inference

To simplify the problem, we introduce a reparametrization motivated by linear space constraints:

[eqn]

which yields the optimization problem:

[eqn]

Here, $[eqn]$ or $[eqn]$ for short, is a general loss function to quantify the error between the model output $[eqn]$ and the observed outcome Y (given $[eqn]$ ).

In this section, we derive an optimization-based iterative algorithm that is both easy to implement and guaranteed to converge, without restricting l to a negative log-likelihood from a GLM.

Closed-form solution for regression

For the standard regression case with a squared Frobenius norm loss $[eqn]$ , it is tempting to solve the problem via an alternating procedure: repeatedly performing multivariate regression to update A and B and applying PCA to estimate C until convergence. Perhaps surprisingly, however, SOLVE admits a joint globally optimal closed-form solution for all parameters.

Let the singular value decomposition (SVD) of $[eqn]$ be $[eqn]$ where $[eqn]$ are orthogonal matrices: $[eqn]$ and D is a diagonal matrix of the singular values in non-increasing order.

For a given rank parameter $[eqn]$ , define a truncated SVD operator $[eqn]$ as

[eqn]

where $[eqn]$ are the matrices formed by the first r columns of $[eqn]$ , respectively, and $[eqn]$ is the diagonal matrix obtained from D by retaining only its largest r singular values.

Theorem 1. The rank-restricted least squares problem

[eqn]

admits a globally optimal solution $[eqn]$ given explicitly by

[eqn]

The proof is presented in the Appendix. Based on (6), the solutions $[eqn]$ can be obtained.

The existence of a non-iterative, closed-form solution to the nonconvex optimization problem (9) is a contribution with practical implications. Many advanced statistical models, particularly those involving latent variables, depend on iterative optimization algorithms such as Expectation-Maximization (EM) or Alternating Least Squares (ALS) or sampling-based methods like Markov Chain Monte Carlo (MCMC). These iterative procedures can be computationally intensive, highly sensitive to initialization parameters, and may converge to local optima rather than the global optimum. In contrast, a closed-form solution with theoretical guarantees is computationally more efficient and scalable.

Iterative optimization via surrogates

One may encounter a non-quadratic loss in (7)—a common example is the logistic loss for binary data under the logistic link (5):

[eqn]

where the so-called “cumulant” $[eqn]$ and $[eqn]$ , the logistic link in (5). In such GLMs, a closed-form solution rarely exists. Fortunately, we can develop an optimization algorithm based on a surrogate function and leverage a closed-form regression solution (cf. Theorem 1) to update the estimate.

We assume the loss gradient enjoys Lipschitz continuity: for any $[eqn]$ ,

[eqn]

for some $[eqn]$ , where $[eqn]$ denotes the Frobenius norm. For instance, $[eqn]$ for the logistic loss since $[eqn]$ , and $[eqn]$ for the standard squared error loss $[eqn]$ .

Define a surrogate function for $[eqn]$ at $[eqn]$ as

[eqn]

Starting with an arbitrary initial point $[eqn]$ with $[eqn]$ , we define an iterative algorithm where the $[eqn]$ -th iterate ( $[eqn]$ ) is the solution to

[eqn]

The following theorem states that when $[eqn]$ is chosen properly large (e.g., $[eqn]$ in classification), the algorithm converges, with its rigorous proof presented in the Appendix.

Theorem 2. Suppose the loss function l has an L-Lipschitz continuous gradient as in (12). Starting with any feasible initial point $[eqn]$ , let the sequence of iterates $[eqn]$ be defined by (14) with an inverse step size $[eqn]$ . Then for any $[eqn]$ , we have $[eqn]$ and

[eqn]

Consequently, the sequence of objective values $[eqn]$ is non-increasing and thus convergent, and under $[eqn]$ , the difference between consecutive iterates converges to zero, i.e., $[eqn]$ as $[eqn]$ .Now, let’s simplify (14).

[eqn]

where $[eqn]$ does not depend on $[eqn]$ . The surrogate function g becomes simply quadratic in $[eqn]$ . Introducing the pseudo-response matrix $[eqn]$ , held fixed in every iteration, reduces the per-iteration optimization to solving a constrained regression.

Define

[eqn]

From Theorem 1, the iterate updates are given by

[eqn]

for all $[eqn]$ .

An outline of our SOLVE algorithm is as follows.Algorithm 1The SOLVE algorithmRequire: $[eqn]$ , r, $[eqn]$ , and $[eqn]$ with $[eqn]$ (e.g., all zero matrices).1: $[eqn]$ , $[eqn]$ 2: while not converged do 3: $[eqn]$ 4: $[eqn]$ 5: $[eqn]$ 6: $[eqn]$ 7: $[eqn]$ 8: $[eqn]$ 9: end while 10: $[eqn]$ , $[eqn]$ , $[eqn]$ 11: return $[eqn]$ , $[eqn]$ .For the GLM loss in (11), $[eqn]$ , where $[eqn]$ is applied componentwise. For standard regression, where $[eqn]$ and $[eqn]$ , setting $[eqn]$ reduces the pseudo-response to $[eqn]$ , allowing for convergence in a single step. For logistic regression, where $[eqn]$ and $[eqn]$ , any $[eqn]$ ensures convergence. In general, the algorithm converges rapidly in practice.

Rank selection

The sole tuning parameter in the SOLVE model is the rank r of the latent structure matrix C, which determines the model’s complexity. To select the optimal rank from the data, we propose using an information criterion that balances model fit with model complexity. Specifically, for a given rank r, the Akaike Information Criterion (AIC) [20] is defined as $[eqn]$ where $[eqn]$ is the log-likelihood of the fitted model and $[eqn]$ is the effective degrees of freedom. For the logistic regression case, the loss corresponds to the model deviance: $[eqn]$ (in implementation, we use the numerically stable “softplus” function, defined as $[eqn]$ , as a substitute for $[eqn]$ to improve computational stability). For the regression case under the assumption of Gaussian errors, $[eqn]$ (up to additive constants). The degrees of freedom for the SOLVE model are given by the total number of free parameters:

[eqn]

Here, pm and nq are the number of parameters in A and B, respectively (which are constant when varying r), and $[eqn]$ is the number of free parameters in the rank-r matrix C.

The practical procedure for rank selection is to fit the SOLVE model over a pre-specified grid $[eqn]$ of candidate ranks, e.g., $[eqn]$ . Standard practice selects the rank that minimizes the information criterion, $[eqn]$ , but we find this approach can overfit by selecting a rank that is too high. Our recommendation is to use the “elbow” point: plotting the information criterion $[eqn]$ against the rank r and identifying the rank at which the rate of decrease in the IC value sharply flattens (cf. Fig. 1). The underlying principle is that the singular values associated with the true signal are substantially larger than those generated by noise, creating a clear gap or elbow that marks the transition from signal to noise and reveals the effective rank. Similar ideas have also been employed in contexts such as determining the number of clusters or principal components [21]. In implementation, for an objective and reproducible selection, we first normalize both axes and then compute the perpendicular distance from each point on the curve to the straight line connecting the endpoints. The elbow point is then defined as the point with the maximum perpendicular distance from this line, corresponding to the position of greatest curvature. In our experience, this data-driven tuning robustly balances model fit and complexity, capturing low-dimensional latent structure without overfitting.

Illustration of the elbow method for selecting matrix rank in both regression (left) and logistic regression (right) models. Simulation uses n=100,m=100,p=5,q=4 and true rank r=3.

Inference via bootstrap

Standard methods for statistical inference, which rely on deriving the asymptotic distribution of estimators, are not readily applicable to the SOLVE method. The presence of the low-rank constraint on the latent matrix C (which is discrete and non-differentiable) introduces challenges related to the incidental parameters problem, where the number of parameters grows with the sample size [11]. To address this, we use parametric bootstrap and wild bootstrap [22, 23] to construct valid confidence intervals.

The procedure begins with fitting the SOLVE model to the original data $[eqn]$ using the rank selected via the information criterion as discussed earlier, yielding the estimats $[eqn]$ and $[eqn]$ . Then a bootstrap response matrix $[eqn]$ is constructed and the model is refit. This is repeated for a large number of bootstrap samples, $[eqn]$ . For example, for the classification case with binary outcomes, we convert the fitted logits $[eqn]$ to probabilities via $[eqn]$ : $[eqn]$ , and then generate each element of the bootstrap data matrix by drawing from the corresponding Bernoulli distribution, $[eqn]$ . For regression, we use a wild bootstrap to robustly handle potential heteroskedasticity: generate residuals $[eqn]$ and apply independent Rademacher weights $[eqn]$ . The bootstrap sample is $[eqn]$ , where $[eqn]$ denotes element-wise multiplication.

After generating the bootstrap data $[eqn]$ for either case, the SOLVE algorithm is refit on $[eqn]$ to obtain bootstrap parameter estimates $[eqn]$ . The resulting collection of estimates is then used to construct confidence intervals, enabling inference on the primary parameters via resampling.

Results and discussion

Performance comparison in simulations: accuracy and rank recovery

On each generated dataset, we applied the SOLVE algorithm (cf. Algorithm 1) and selected the optimal rank by the elbow method from a candidate grid of $[eqn]$ . For comparison, we also evaluated SVA [8] and CATE [10] using their standard R package implementations. Since these methods do not include column-specific predictors, we first regressed Y on Z and then applied the algorithms to the resulting residual matrix. Both SVA and CATE were configured to automatically select the number of latent factors.

To measure estimation accuracy, we computed the relative Frobenius norm error between the true parameters and their estimates, e.g., $[eqn]$ . The performance in identifying the correct number of latent factors was measured by the percentage of experiments where the estimated rank $[eqn]$ exactly matches the true rank $[eqn]$ . The total computational time was calculated by summing the execution time of the model-fitting and tuning of all 50 replicates. A summary of the experimental results is presented in Table 1.

In the regression setting, both SVA and CATE provide reasonable estimates for the overall signal $[eqn]$ , particularly at lower noise levels. However, they struggle to accurately disentangle and estimate the primary coefficient matrices A and B. For example, at the noise level $[eqn]$ , the estimation errors for A are high for both SVA (24.12) and CATE (30.38). In contrast, SOLVE offers a substantial improvement, achieving an error of only 7.71 for A—a more than three-fold improvement. This advantage in recovering the primary coefficients persists as the noise level increases. At $[eqn]$ , the ability of SVA and CATE to recover the correct rank collapses. SOLVE, however, proves far more robust, maintaining a decent rank recovery rate and providing significantly more accurate estimates for A and B. In the more difficult classification setting, to which SVA and CATE are not applicable, SOLVE continues to perform well. The estimation errors are higher (as anticipated due to the information loss from binary outcomes) but remain controlled. The rank selection procedure remains effective, correctly identifying the true rank in 86% of simulations.

SOLVE also delivers a substantial computational advantage, as shown in Table 1. To examine scalability more directly, Table 2 reports average runtimes for model fitting and tuning under varying $[eqn]$ while holding other settings fixed (e.g., $[eqn]$ , $[eqn]$ , $[eqn]$ ). The results are striking: SOLVE runs about 5 $[eqn]$ faster than SVA and nearly 50 $[eqn]$ faster than CATE across all tested dimensions. This persistent gap highlights SOLVE’s strong scalability and its practical value for larger problems.

Overall, the simulation results demonstrate that the proposed SOLVE algorithm more efficiently and accurately recovers the true coefficient matrices and latent structure from confounded matrix data, even in high-noise and non-Gaussian settings.

Uncovering biologically coherent gene-drug associations

Using the CCLE data, we fit SOLVE to standardized predictor matrices (intercepts included). We identified 12 latent factors via rank tuning and performed inference using 1000 bootstrap samples (cf. Materials and Method). The volcano plot in Fig. 2 summarizes the resulting effect sizes and P-values. A comparative analysis against the CATE package [10] revealed numerous significant gene-drug interactions that were uniquely discovered by SOLVE, some illustrated in the figure.

CCLE data. The top panel provides a global overview of the effect size and significance of all estimated gene-drug coefficients (A^) by SOLVE. The bottom panel shows bootstrap distributions for 10 top discoveries by SOLVE that were missed by CATE (CATE P-values > .98, SOLVE P-values < 5e-3), where row labels denote specific gene-drug pairs.

We demonstrate two primary strengths of the SOLVE model. The first is its ability to produce biologically coherent results by explicitly modeling the shared properties (Z) of related drugs, unlike methods that treat each compound independently. This advantage is evident in its analysis of two therapeutic compounds, the MEK inhibitors, PD-0325901 and AZD6244. Leveraging their strongly correlated molecular descriptors in the drug-feature matrix (correlation $[eqn]$ ), SOLVE translates this structural similarity into nearly identical gene-interaction profiles (Pearson $[eqn]$ ). In contrast, the CATE-derived gene profiles for these two drugs were only moderately correlated ( $[eqn]$ ). SOLVE uses shared molecular features of drugs to predict gene associations that are more pharmacologically accurate.

Another key advantage of the SOLVE model is its increased statistical power, driven in part by effective characterization of latent factors (C), which enables detection of biologically meaningful gene-drug associations often missed by CATE. For example, SOLVE identified a highly significant association between GRB7 expression and sensitivity to the drug Erlotinib (P = 4e-4 for SOLVE vs. .98 for CATE). Erlotinib is a well-established EGFR inhibitor, and GRB7 encodes an adaptor protein mediating signaling via EGFR and HER2 pathways; its overexpression is a marker of pathway hyperactivation linked to cancer progression and resistance [24]. Similarly, SOLVE revealed a strong association between SMARCD2 expression and Irinotecan sensitivity (P = 1e-4 for SOLVE vs. .99 for CATE). Irinotecan is a topoisomerase I inhibitor that induces DNA damage, and SMARCD2 encodes a core SWI/SNF complex subunit essential for the DNA damage response [25]. Notably, loss of SWI/SNF function has been shown to sensitize cancer cells to topoisomerase inhibitors [26].

Interestingly, beyond recovering established drug-target interactions, SOLVE uncovers new, biologically informative associations. For Erlotinib, SOLVE identified both the canonical target EGFR and novel candidates genes such as PRR15L as highly significant predictors of drug sensitivity (both P = 1e-4), whereas CATE failed to detect the latter (P = .30). High PRR15L expression may signal a cell’s functional reliance on EGFR-driven survival signaling [27, 28]. This highlights an important insight: biomarkers reflecting downstream pathway dependence can complement target expression in predicting therapeutic response. Overall, compared with conventional approaches, SOLVE’s ability to borrow information across mechanistically related drugs and effectively reduce structured noise by modeling latent factors leads to greater sensitivity for discovering such nuanced gene-drug relationships.

Improved power, specificity, and latent pathway discovery in GCSI

For the classification task on the GCSI dataset, we compared SOLVE with logistic regression, using 1000 bootstrap samples to assess the significance of gene-drug associations. (Other prominent latent factor methods like SVA and CATE were excluded, as they are designed for regression problems.) Figure 3 shows a demonstration.

GCSI data. Comparison of SOLVE and standard logistic regression (without latent factors) in identifying significant gene-drug pairs (formatted as Gene Name—Drug Name). Top: Raincloud plots of the top significant gene-drug associations identified by SOLVE but missed by standard logistic regression. Bottom: Examples where logistic regression findings were corrected by SOLVE to lower significance.

The top panel of Fig. 3 highlights SOLVE’s increased statistical power to detect biologically meaningful associations missed by standard methods. A prime example is the link between SFN (Stratifin, also known as 14-3-3 $[eqn]$ ) expression and the drug Paclitaxel, which SOLVE identified as highly significant (P = 1e-3), whereas standard logistic regression failed to distinguish this signal from noise (P > .14).

Paclitaxel induces mitotic arrest through microtubule stabilization [29]. SFN is a p53-regulated checkpoint factor that prevents mitotic entry by sequestering the CDC2-Cyclin B1 complex in the cytoplasm under stress [30]. Because elevated SFN enforces G2 arrest and has been associated with therapeutic resistance in human cancers [31], its expression naturally aligns with reduced sensitivity to anti-mitotic agents. The ability of SOLVE to recover this association suggests that it accounts for latent cell-cycle states that obscure such biologically grounded effects in standard models.

Conversely, SOLVE demonstrated greater specificity by filtering out potentially spurious associations that simpler models may flag as significant. For example, while standard logistic regression identified a significant association between NRAS expression and sensitivity to Doxorubicin (P < .005), SOLVE found this link to be insignificant (P > .1). NRAS activation is a canonical driver of uncontrolled cellular proliferation via the MAPK signaling cascade [32]. Since cytotoxic agents preferentially kill rapidly dividing cells, high proliferation rates can generate broad, non-specific drug sensitivity in large-scale screens, a systematic confounding effect driven by differences in baseline division rate [33]. This correlation reflects a generalized cellular state rather than a specific, mechanistically grounded biomarker of Topoisomerase II inhibition. The association detected by logistic regression therefore likely reflects general pathway activity rather than a direct predictive dependency, whereas SOLVE’s regularized latent-factor framework helps distinguish robust drug-biomarker relationships from confounded ones.

Finally, we explored the biological pathway underlying the tuned, rank-1 latent structure found by SOLVE, whose low-rank patterns are visualized in the heatmap in Fig. 4. By design, the latent factor matrix $[eqn]$ is orthogonal to the model predictors, representing systematic variation in drug response unexplained by the genes retained in X. In this experiment, we purposefully excluded explicit intercepts, as the low-rank latent term is capable of absorbing these baseline effects. Inspection shows that no row or column of $[eqn]$ is constant; hence, the discovered latent factor does not act merely as an intercept, but rather captures meaningful co-variation patterns between groups of cell lines and drugs. The estimated latent matrix was tuned to rank-1, meaning all drug responses are modulated along the same latent direction. Each drug therefore differs only by a single scalar coefficient, which determines the sign and magnitude of its modulation. Consequently, drugs with similar coefficients yield nearly identical column patterns, while drugs with coefficients of opposite sign produce column patterns that are exact opposites.

Heatmap of the estimated latent factor matrix of unit rank. Row labels display the names of GCSI cancer cell lines, annotated with their corresponding tissue of origin (disease type). Disease abbreviations: “RCC” (Renal Clear Cell Carcinoma), “AdenoCa” (Adenocarcinoma), “Inv” (Invasive), and “Ca” (Carcinoma). Column labels display the therapeutic compounds corresponding to the drug-response endpoints (binarized from gCSI AUC measurements for classification). The color scale indicates the magnitude of the latent factor score.

We performed SVD on the estimate $[eqn]$ , the left singular vector u yielding loadings for each cell line. The 10 cell lines with the highest and lowest loadings were selected, and a differential expression analysis was conducted using cBioPortal [34]; the top 100 genes, ranked by P-value, were subsequently analyzed for pathway enrichment using Enrichr [35]. This revealed significant enrichment for the Unfolded Protein Response (UPR) pathway across multiple databases, including Reactome Pathways 2024, BioPlanet 2019, and MSigDB Hallmark 2020. This finding is highly relevant, as chronic UPR activation is a known mechanism of drug resistance in cancer [36]. By uncovering this hidden biological state, SOLVE demonstrates its ability to produce more mechanistically interpretable models.

Conclusion

Methodological contributions

This work introduces SOLVE, a unified latent-factor framework that jointly models row- and column-specific predictors with a low-rank component for unmeasured variation. In pharmacogenomic analyses, it simultaneously leverages baseline gene expression profiles and drug descriptors while correcting for unobserved confounders—tasks often handled separately by first regressing out measured covariates, then applying PCA to the residuals to capture latent structure.

By enforcing orthogonality between the spaces spanned by observed predictors and the latent structure, the framework ensures identifiability, separating measured effects from residual variation so that latent factors capture only unexplained structure, thus enhancing interpretability.

Computationally, the algorithm alternates truncated SVD updates for the latent term with projection-based coefficient refinements, guided by a quadratic surrogate to efficiently handle general, non-quadratic loss functions. Rank selection uses an elbow method with a degrees-of-freedom penalized criterion to balance fit and complexity, and the framework also supports inference on gene-drug associations via a parametric bootstrap.

As demonstrated in simulations, this integrated approach yields more accurate parameter estimates and more robust rank recovery than competing latent factor models, particularly in high-noise and non-Gaussian settings.

Pharmacogenomic insights

Applications to large-scale pharmacogenomic datasets (CCLE and GCSI) showed that SOLVE recovers established mechanisms and reveals new, mechanistically plausible associations. SOLVE produced more biologically coherent results by leveraging shared drug descriptors, as demonstrated by its ability to assign nearly identical gene-interaction profiles to two MEK inhibitors with highly correlated molecular features.

Moreover, SOLVE showed increased statistical power, not only recovering established gene-drug associations missed by other latent factor methods—like the links between GRB7 and Erlotinib sensitivity or SMARCD2 and Irinotecan sensitivity—but also uncovering new insights, such as identifying PRR15L as a potential biomarker for downstream pathway dependence on EGFR signaling.

The proposed model also demonstrated greater specificity by correcting spurious associations, such as the finding that NRAS expression was not a significant predictor for Binimetinib sensitivity, in contrast to conventional logistic regression.

Complementing this improved specificity, the model discovered a latent factor in the GCSI data that was significantly enriched for the Unfolded Protein Response (UPR) pathway, a known mechanism of drug resistance.

These findings underscore the value of SOLVE in precision oncology by disentangling measured predictor effects from latent confounding, yielding more robust biomarkers for patient stratification and deeper insight into drug resistance mechanisms.

Future directions

While motivated by challenges in pharmacogenomics, the SOLVE framework addresses the general problem of matrix outcomes by jointly modeling the effects of both structured predictors (for rows and columns) and unobserved confounders. It is therefore broadly applicable to other settings with this data structure, such as microbial genomics or multi-omics integration.

Promising extensions to the framework include incorporating structured sparsity in the coefficient matrices, for instance, by enforcing row-wise sparsity in matrix A for gene selection or column-wise sparsity in matrix B for molecular descriptor selection, while preserving the low-rank latent structure. This enhancement would directly address high-dimensional applications by relaxing the assumption that $[eqn]$ and $[eqn]$ , enabling simultaneous variable selection and latent structure estimation.

Furthermore, to address the limitation that not all unmeasured confounding is low-rank, another extension could model the residual term as a composite $[eqn]$ , analogous to Robust PCA [37]. Here, C would continue to capture the global, modular confounders, while a sparse matrix S would account for local, high-impact but rare events (e.g., a cell-type-specific mutation).

Finally, a further direction is to integrate multi-omics predictors (e.g., mutations, copy number, proteomics) to better disentangle pathway-specific from global stress signals, while leveraging prior biological networks to guide latent factor estimation.

Key Points

Unified Modeling Framework: SOLVE introduces a unified framework that jointly models row and column predictors with a low-rank latent component in a single, flexible model that handles diverse data types beyond standard regression.
Guaranteed Identifiability and Interpretability: The framework resolves a critical ambiguity in latent variable modeling by imposing carefully designed orthogonality constraints, forcing the latent component to capture only biological variation unexplained by measured covariates.
Computationally Efficient and Scalable Algorithm: A computationally efficient algorithm with theoretical guarantees of convergence features closed-form updates, where the latent component is derived via a truncated SVD of a doubly projected residual, yielding a globally optimal solution for regression cases.
Superior Power for Biomarker Discovery: The method demonstrates superior sensitivity and specificity in pharmacogenomic screens, recovering critical gene-drug associations missed by standard latent factor approaches (such as the link between GRB7/EGFR and EGFR inhibitors) and identifying novel therapeutic candidates.
Discovery of Latent Biological Mechanisms: The orthogonal latent structure reveals hidden biological drivers of sample response that are not captured by measured features, such as identifying a latent Unfolded Protein Response (UPR) signature as a key drug resistance mechanism in cancer cell lines.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Leek JT , Scharpf RB, Bravo HC et al Tackling the wider challenges of high-throughput biology. Nat Rev Genet 2010;11:733–9.20838408 10.1038/nrg 2825 PMC 3880143 · doi ↗ · pubmed ↗
2Gagnon-Bartsch J , Jacob L Speed TP. Removing unwanted variation from high-dimensional data with negative controls. Berkeley: Tech Reports from Dep Stat Univ California, 2013, 1–112.
3Barretina J , Caponigro G, Stransky N et al The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 2012;483:603–7.22460905 10.1038/nature 11003 PMC 3320027 · doi ↗ · pubmed ↗
4Ohnmacht A-J , Rajamani A, Avar G et al The pharmacoepigenomic landscape of cancer cell lines reveals the epigenetic component of drug sensitivity. Commun Biol 2023;6:825.37558831 10.1038/s 42003-023-05198-y PMC 10412573 · doi ↗ · pubmed ↗
5Hanahan D , Weinberg RA. Hallmarks of cancer: the next generation. Cell 2011;144:646–74.21376230 10.1016/j.cell.2011.02.013 · doi ↗ · pubmed ↗
6Junttila MR , de Sauvage FJ. Influence of tumour micro-environment heterogeneity on therapeutic response. Nature 2013;501:346–54.24048067 10.1038/nature 12626 · doi ↗ · pubmed ↗
7Price AL , Patterson NJ, Plenge RM et al Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 2006;38:904–9.16862161 10.1038/ng 1847 · doi ↗ · pubmed ↗
8Leek JT , Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. P Lo S Genet 2007;3:e 161–1735.17907809 10.1371/journal.pgen.0030161 PMC 1994707 · doi ↗ · pubmed ↗