On the role of the overall effect in exponential families
Anna Klimova, Tam\'as Rudas

TL;DR
This paper examines how adding or removing the overall effect in exponential families affects their properties, geometry, and computational aspects, with implications for statistical modeling and biological data analysis.
Contribution
It characterizes the impact of the overall effect on exponential family properties, geometry, and algorithms, linking algebraic geometry concepts to statistical modeling.
Findings
Adding the overall effect creates the smallest regular exponential family containing the curved one.
Removing the overall effect simplifies the family but can lead to different estimation properties.
Including the overall effect can produce estimates outside the intended model in biological applications.
Abstract
Exponential families of discrete probability distributions when the normalizing constant (or overall effect) is added or removed are compared in this paper. The latter setup, in which the exponential family is curved, is particularly relevant when the sample space is an incomplete Cartesian product or when it is very large, so that the computational burden is significant. The lack or presence of the overall effect has a fundamental impact on the properties of the exponential family. When the overall effect is added, the family becomes the smallest regular exponential family containing the curved one. The procedure is related to the homogenization of an inhomogeneous variety discussed in algebraic geometry, of which a statistical interpretation is given as an augmentation of the sample space. The changes in the kernel basis representation when the overall effect is included or removed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
On the role of the overall effect in exponential families
Anna Klimova
National Center for Tumor Diseases (NCT), Partner Site Dresden, and
Institute for Medical Informatics and Biometry,
Technical University, Dresden, Germany
Tamás Rudas
Center for Social Sciences, Hungarian Academy of Sciences, and
Department of Statistics, Eötvös Loránd University, Budapest, Hungary
Abstract
Exponential families of discrete probability distributions when the normalizing constant (or overall effect) is added or removed are compared in this paper. The latter setup, in which the exponential family is curved, is particularly relevant when the sample space is an incomplete Cartesian product or when it is very large, so that the computational burden is significant. The lack or presence of the overall effect has a fundamental impact on the properties of the exponential family. When the overall effect is added, the family becomes the smallest regular exponential family containing the curved one. The procedure is related to the homogenization of an inhomogeneous variety discussed in algebraic geometry, of which a statistical interpretation is given as an augmentation of the sample space. The changes in the kernel basis representation when the overall effect is included or removed are derived. The geometry of maximum likelihood estimates, also allowing zero observed frequencies, is described with and without the overall effect, and various algorithms are compared. The importance of the results is illustrated by an example from cell biology, showing that routinely including the overall effect leads to estimates which are not in the model intended by the researchers.
- Keywords:
algebraic variety, contingency table, independence, log-linear model, maximum likelihood estimation, overall effect, relational model
1 Introduction
This paper deals with exponential families of probability distributions over discrete sample spaces. When defining such families, usually, a normalizing constant, which of course, is constant over the sample space but not over the family, is included. The presence of the normalizing constant implies that the parameter space may be an open set, which, in turn, is necessary for asymptotic normality of estimates and for the applicability of standard testing procedures. The normalizing constant, from an applied perspective, may be interpreted as a baseline or common effect, present everywhere on the sample space and is, therefore, also called the overall effect. The focus of the present work is to better understand the implications of having or not having an overall effect in such families, in particular how adding or removing the overall effect affects the properties of discrete exponential families.
Motivated by a number of important applications, Klimova \BOthers. (\APACyear2012); Klimova \BBA Rudas (\APACyear2012, \APACyear2016) developed the theory of relational models, which generalize discrete exponential families, also called log-linear models, to situations when the sample space is not necessarily a full Cartesian product, the statistics defining the exponential family are not necessarily indicators of cylinder sets, and the overall effect is not necessarily present. Exponential families without the overall effect are particularly relevant, sometimes necessary, when the sample space is a proper subset of a Cartesian product. Several real examples, when certain combinations of the characteristics were either not possible logically or were left out from the design of the experiment were discussed in Klimova \BOthers. (\APACyear2012). A real problem of this structure from cell biology is analyzed in this paper, too. When the overall effect is not present, the standard normalization procedure to obtain probability distributions cannot be applied, because the family is curved Klimova \BOthers. (\APACyear2012). When, in spite of this, the standard normalization procedure is applied, as was done in this analysis, the resulting estimates do not possess the fundamental model properties.
The standardization of the estimates in exponential families is also an issue, when the size of the problem is very large and the computational burden is significant. Some Neural probabilistic language models are relational models. Due to the high-dimensional sample space, the evaluation of the partition function, which is needed for normalization, may be intractable. Some of the methods of parameter estimation under such models are based on the removal of the partition function, that is, the removal of the overall effect from the model and performing model training using the models without the overall effect. Approximations of estimates with and without the overall effect were studied, for example, by Mnih \BBA Teh (\APACyear2012) and Andreas \BBA Klein (\APACyear2015), among others. A different approach to avoiding global normalization (i.e., having an overall effect) is described in Koller \BBA Friedman (\APACyear2009). However, the implications of the removal of the overall effect are not discussed in the existing literature.
Another area where removing or including the overall effect is relevant, is context specific independence models, see, e.g., Høsgaard (\APACyear2004) and Nyman, Pensar, Koski\BCBL \BBA Corander (\APACyear2016). When the sample space is an incomplete Cartesian product, removing the overall effect, as described in this paper, specifies different variants of conditional independence in the parts of the sample space, depending on whether or not the part is or is not affected affected by the missing cells.
While including the overall in the definition of the statistical model to be investigated is seen by many researchers as “natural’ or “harmless”, we show in this paper that adding or removing the overall effect may dramatically change the characteristics of the exponential family, up to the point of altering the fundamental model property intended by the researcher.
The main results of the paper include showing that allowing the overall effect expands the curved exponential family to the smallest regular exponential family which contains it. When the overall effect is removed, the sample space may have to be reduced (if there were cells which contained the overall effect only), and the changes in the structure of the generalized odds ratios defining the model are described in both cases. In the language of algebraic geometry, the procedure of removing the overall effect is identical to the dehomogenization of the variety defining the model (Cox, Little\BCBL \BBA O’Shea, \APACyear2007). An important area of applications of the results presented here is the case when several binary features are observed, but the combination that no feature is present is either is impossible logically or is possible but was left out from the study design. The converse of dehomogenization, that is homogenizing a variety, involves including a new variable, and it is shown that in some cases this can be identified, from a statistical perspective, with augmenting the sample space by a cell which is characterized by no feature being present. For example, the Aitchison – Silvey independence (Aitchison \BBA Silvey, \APACyear1960; Klimova \BBA Rudas, \APACyear2015) is homogenized, through the augmentation of the sample space, into the standard independence model.
The paper is organized as follows. Section 2 gives a canonical definition of relational models using homogeneous, and if there is no overall effect included, one inhomogeneous generalized odds ratios, called dual representation and shows that including the overall effect is identical to omitting the inhomogeneous generalized odds ratio from it.
Section 3 contains the result that including the overall effect expands the curved exponential family into the smallest regular one containing it. For the case of the removal of the overall effect, the dual representation of the model is given, and the relevance of certain results in algebraic geometry to the statistical problem is discussed. In particular, the homogenization of a variety through including a new variable is identified with augmenting the sample space with a cell where no feature is present, when this is meaningful. It is proved that the homogenization of the Aitchison – Silvey (in the sequel, AS) independence model, which is defined on sample spaces where all combinations of features, except for the “no feature present” combination, are possible, is the usual model of mutual independence on the full Cartesian product obtained after augmenting the sample space with the missing cell. The relationship of these results with context specific independence is also described.
Section 4 compares the maximum likelihood (ML) estimates in geometrical terms for relational models with and without the overall effect and based on the insight obtained, a modification of the algorithm proposed in Klimova \BBA Rudas (\APACyear2015) is given. It is illustrated, that the ML estimates under two models which differ only in the lack or presence of the overall term, may be very different, up to the point of the existence or no existence of positive ML estimates, when the data contain observed zeros. However, when the MLE exists in the model containing the overall effect, it also does in the model obtained after the removal of the overall effect.
Finally, Section 5 discusses an example of applications of relational models in cell biology. The equal loss of potential model in hematopoiesis (Perié \BOthers., \APACyear2014) is a relational model without the overall effect. The published analysis of this model added the overall effect to it, to simplify calculations, and with this changed the properties of the model so that the published estimates do not fulfill the fundamental model property.
2 A canonical form of relational models
Let be random variables taking values in finite sets , respectively. Let the sample space be a non-empty, proper or improper, subset of , written as a sequence of length in the lexicographic order. Assume that the population distribution is parameterized by cell probabilities , where and , and denote by the set of all strictly positive distributions on . For simplicity of exposition, a distribution in will be identified with its parameter, , and \mathcal{P}=\{\mbox{\boldmathp}>\mbox{\boldmath0}:\,\,\mbox{\boldmath1}^{\prime}\mbox{\boldmathp}=1\}.
Let be a matrix of full row rank with 0–1 elements and no zero columns. A relational model for probabilities generated by is the subset of that satisfies:
[TABLE]
where = is the vector of log-linear parameters of the model. A dual representation of a relational model can be obtained using a matrix, , whose rows form a basis of , and thus, = :
[TABLE]
The number of the degrees of freedom of the model is equal to . In the sequel, \mbox{\boldmathd}_{1}^{\prime},\mbox{\boldmathd}_{2}^{\prime},\dots,\mbox{\boldmathd}_{K}^{\prime} denote the rows of . The dual representation can also be expressed in terms of the generalized odds ratios:
[TABLE]
or in terms of the cross-product differences:
[TABLE]
where and stand for, respectively, the positive and negative parts of a vector . The following dual representation is invariant of the choice of the kernel basis.
Let denote the polynomial variety associated with (Sturmfels, \APACyear1996):
[TABLE]
The relational model generated by is the following set of distributions:
[TABLE]
where is the interior of the -dimensional simplex.
Notice that the variety includes elements with zero components as well and can be used to extend the definition of the model to allow zero probabilities. The extended relational model, \hbox{\vbox{\hrule height=0.5pt\kern 2.15277pt\hbox{\kern-2.5ptRM\kern-1.00006pt}}}(\mathbf{A}), is the intersection of the variety with the probability simplex:
[TABLE]
See Klimova \BBA Rudas (\APACyear2016) for more detail on the extended relational models.
Let \mbox{\boldmath1}^{\prime} = be the row of ’s of length . If does not belong to the space spanned by the rows of , the relational model is said to be a model without the overall effect. Such models are specified using homogeneous and at least one non-homogeneous generalized odds ratios, and the corresponding variety is non-homogeneous (Klimova \BOthers., \APACyear2012).
Proposition 1**.**
Let be a model without the overall effect. There exists a kernel basis matrix whose rows satisfy:
[TABLE]
Proof.
A relational model does not contain the overall effect if and only if it can be written using non-homogeneous (and possibly homogeneous) generalized odds ratios (Klimova \BOthers., \APACyear2012). Therefore, has at least one row, say \mbox{\boldmathd}_{1}^{\prime}, that is not orthogonal to : C_{1}=\mbox{\boldmathd}_{1}^{\prime}\mbox{\boldmath1}\neq 0.
Suppose there exists another row, say , that is not orthogonal to and thus . The vectors and are linearly independent, so are the vectors and . Substitute the row with the row . And so on. ∎
It is assumed in the sequel that is not in the row space of . Notice that, because is 0–1 matrix without zero columns, this is only possible when . Throughout the entire paper, the kernel basis matrix is assumed to satisfy (8), and, without loss of generality, \mbox{\boldmathd}_{1}^{\prime}\mbox{\boldmath1}=-1.
Some consequences of adding the overall effect to a relational model will be investigated by comparing the properties of the relational model generated by and the model generated by the matrix obtained by augmenting the model matrix with the row :
[TABLE]
Let be the relational model generated by . Because is a row of , the corresponding polynomial variety is homogeneous (cf. Sturmfels, \APACyear1996).
Theorem 1**.**
The dual representation of can be obtained from the dual representation of by removing the constraint specified by a non-homogeneous odds ratio from the latter.
Proof.
Write the dual representation of in terms of the generalized log odds ratios:
[TABLE]
By the previous assumption, \mbox{\boldmathd}_{1}^{\prime}\mbox{\boldmath1}=-1, and thus, the constraint \mbox{\boldmathd}_{1}^{\prime}\log\mbox{\boldmathp}=0 is specified by a non-homogeneous odds ratio. Define as:
[TABLE]
Because, from (8), \mbox{\boldmathd}_{2},\dots,d_{K}\in Ker(\mathbf{A}),
[TABLE]
and thus, \mbox{\boldmathd}_{2},\dots,d_{K}\in Ker(\bar{\mathbf{A}}). Finally, as , \mbox{\boldmathd}_{2},\dots,d_{K} is a basis of , and therefore,
[TABLE]
∎
3 The influence of the overall effect on the model structure
The consequences of adding or removing the overall effect will be studied separately. The changes in the model structure after the overall effect is added are considered first.
Let be a relational model without the overall effect and be the corresponding augmented model. Let for , . For any :
[TABLE]
where , , are the log-linear parameters of . In particular, is the overall effect of .
Theorem 2**.**
The augmented model, , is the minimal regular exponential family which contains , and
[TABLE]
Proof.
The second claim is proved first. Denote . Let be a kernel basis matrix of , having the form (8), and notice that
[TABLE]
Therefore, any , satisfies , and thus, belongs to . On the other hand, for any , both and must hold, which immediately implies that .
The first claim is proved next.
The relational model is a curved exponential family parameterized by
[TABLE]
If the overall effect is added to , the parameter space gets an additional parameter:
[TABLE]
Because is the smallest open set in that contains , it parameterizes the minimal regular exponential family containing . This family is, in fact, . ∎
Example 1**.**
The relational models generated by the matrices
[TABLE]
consist of positive probability distribution which can be written in the following forms:
[TABLE]
where is the overall effect. The dual representations can be written in the log-linear form, using \mbox{\boldmathd}_{1}=(-1,0,1,-1)^{\prime},\mbox{\boldmathd}_{2}=(-1,1,0,0)^{\prime}\in Ker(\mathbf{A}):
[TABLE]
By Theorem 1, after the overall effect is added, the model specification does not include the non-homogeneous constraint anymore. In terms of the generalized odds ratios:
[TABLE]
The second model may be defined using restrictions only on homogeneous odds ratios, and there is no need to place an explicit restriction on the non-homogeneous odds ratio.
∎
The changes in the model structure after the overall effect is removed are examined next.
A relational model with the overall effect can be reparameterized so that its model matrix has a row of ’s, and because of full row rank, this vector is not spanned by the other rows. The implications of the removal of the overall effect will be investigated using a model matrix of this structure, say . By the removal of the row , one may obtain a different model matrix on the same sample space, but it may happen that there exists a cell , whose only parameter is the overall effect, and after its removal, the -th column contains zeros only. In such cases, to have a proper model matrix, such columns, that is such cells, need to be removed. Write for the set of all such cells , and let . Then, the reduced model matrix, , is obtained from after removing the row of ’s and deleting the columns which, after this, contain only zeros. This is a model matrix on . Without loss of generality, the matrix can be written as:
[TABLE]
If the sample spaces of and are the same that is, when is empty, the reduced model is the subset of the original one, consisting of the distributions whose overall effect is zero, see Theorem 2. If the sample space is reduced, the relationship between the kernel basis matrices is described in the next result.
Theorem 3**.**
The following holds:
- (i)
. 2. (ii)
The kernel basis matrix of may be obtained from the kernel basis matrix of by deleting the the columns in and then leaving out the redundant rows.
Proof.
- (i)
Because is a matrix of full row rank, . The linear independence of its rows implies that the rows of are also linearly independent. Therefore, because is a matrix, , which implies the result. 2. (ii)
Let be a kernel basis of . Write
[TABLE]
Then,
[TABLE]
which implies that
[TABLE]
Suppose does not have the overall effect. Notice that each has length , and therefore, one can apply a non-singular linear transformation to the basis vectors to reduce them to the form:
[TABLE]
The equations (11) imply that
[TABLE]
[TABLE]
The linear independence of in entails the linear independence of in . Notice that are jointly linearly independent from , but not necessarily linearly independent from each other. A kernel basis of comprises linearly independent vectors in , and, for example, form such a basis. Therefore, can be derived from a kernel basis matrix of by removing the columns for and leaving out the redundant rows.
Suppose has the overall effect; without loss of generality, is a row of . In this case, (11) implies that both and , for . Because ’s and ’s vary independently from each other, the linear independence of will imply that , for , are also linearly independent in . Consequently, any vectors among ’s are linearly independent in and can form a kernel basis of . Thus, as in the previous case, can be derived from a kernel basis matrix of by removing the columns for and leaving out the redundant rows.
∎
The next two examples illustrate the theorem.
Example 2**.**
Let be the relational model generated by
[TABLE]
Here, . In terms of the generalized odds ratios the model can be written as:
[TABLE]
Remove the row and the last two columns and consider the reduced matrix:
[TABLE]
The model does not have the overall effect and can be specified by two generalized odds ratios:
[TABLE]
These odds ratios are defined on the smaller probability space, and may be obtained by removing and , and the redundant odds ratio, from the odds ratio specification of the original model. ∎
Example 3**.**
Consider the relational model generated by
[TABLE]
In terms of the generalized odds ratios, the model specification is . Notice that is row equivalent to
[TABLE]
Because every in is orthogonal to , its last component has to be zero: . Therefore, in any specification of in terms of the generalized odds ratios, will not be present. Set
[TABLE]
The model has the overall effect and can be specified by exactly the same generalized odds ratio as the model : .
As a further illustration, take
[TABLE]
In this case,
[TABLE]
and is the same as above. In this case, is specified by , and is described as previously: . ∎
The polynomial variety defining the model is homogeneous. If the removal of the cells comprising leads to a model without the overall effect, the variety is dehomogenized, yielding the affine variety (cf. Cox \BOthers., \APACyear2007).
The converse to this procedure, homogenization of an affine variety, is also studied in algebraic geometry, and is performed by introducing a new variable in such a way that all equations defining the variety become homogeneous. The essence of this procedure is that all probabilities are multiplied by this new variable. This leaves the homogeneous odds ratios unchanged, as the new variable cancels out. The value of a non homogeneous odds ratio becomes, instead of , the reciprocal of the new variable. For example, the odds ratio in Example 2 becomes , where is the new variable. If now could be seen as the probability of an additional cell, say , then this would be a homogeneous odds ratio, .
Although a straightforward procedure in algebraic geometry, it does not necessarily have a clear interpretation in statistical inference. Introducing a new variable and a new cell for the purpose of homogenization can be made meaningful in some situations, if the sample space may be extended by one cell, and the new variable is the parameter (probability) of this cell. Homogenization requires this new variable to appear in every cell, too, so the parameter may be seen as the overall effect. The new cell has only the overall effect, thus no feature is present in this cell.
The augmentation of the sample space by an additional cell does make sense, if that cell exists in the population but was not observed because of the design of the data collection procedure, as in Example 4. The additional cell has the overall effect only, thus is a “no feature present” cell.
Example 4**.**
In the study of swimming crabs by Kawamura, Matsuoka, Tajiri, Nishida\BCBL \BBA Hayashi (\APACyear1995), three types of baits were used in traps to catch crabs: fish alone, sugarcane alone, fish-sugarcane combination. The sample space consists of three cells, , and the cell is absent by design, because there were no traps without any bait. Under the AS independence, the cell parameter associated with both bait types present is the product of the parameters associated with the other two cells. This is a relational model without the overall effect, generated by the matrix (cf. Klimova \BBA Rudas, \APACyear2015):
[TABLE]
In fact, the overall effect cannot be included in this situation, because it would saturate the model. The affine variety associated with this model can be homogenized by including a new variable. The new variable may only be interpreted as the parameter associated with no bait present, and calls for an additional cell in the sample space (to avoid model the saturation of the model) which may only be interpreted as setting up a trap without any bait, which would also be a plausible research design. The resulting model is generated by :
[TABLE]
and indeed, is the model of traditional independence on the complete contingency table. ∎
For situations like in Example 4, the AS independence is a natural model, but it also applies to cases, when the “no feature present” situation is logically impossible (like market basket analysis, or records of traffic violations, see Klimova \BOthers. (\APACyear2012); Klimova \BBA Rudas (\APACyear2015), and also the biological example in Section 5), and in such cases, the cell augmentation procedure is not meaningful. There are, however situations, when the existence of the “no feature present” cell is logically not impossible, but the actual existence in the population is dubious.
For a more general discussion of the homogenization of AS independence, let be a kernel basis of , satisfying (8) with . The polynomial ideal associated with the matrix is generated by one non-homogeneous polynomial , and homogeneous polynomials, . Notice that, because = , the difference in the degrees of the monomials and is . Therefore, the polynomial can be homogenized by multiplying the first monomial by one additional variable, say :
[TABLE]
The polynomial ideal generated by
[TABLE]
and the corresponding variety are homogeneous, and can be described by the matrix of size of the following structure:
[TABLE]
Here, is the row of ’s of length , and is the column of zeros of length .
In fact, the homogeneous variety is the projective closure of the affine variety (Cox \BOthers., \APACyear2007). The latter can be obtained from the former by dehomogenization via setting .
The homogenization of the model of AS independence for three features is discussed next.
Example 5**.**
Consider the model of AS independence for three attributes, , , and , described in Klimova \BOthers. (\APACyear2012).
[TABLE]
Here for , where the combination does not exist, and . The equations (12) specify the relational model generated by
[TABLE]
Consider the following kernel basis matrix which is of the form (8):
[TABLE]
The corresponding polynomial ideal is:
[TABLE]
The generating set of includes at least one non-homogeneous polynomial, due to , and can be homogenized by introducing a new variable, say . The resulting ideal,
[TABLE]
is homogeneous, and its zero set
[TABLE]
where
[TABLE]
is thus a homogeneous variety. The relational model is defined on a larger sample space, namely . The model has the overall effect and is the following set of distributions:
[TABLE]
The rows of are the indicators of the cylinder sets of the total (the row of ’s), and of the , , and marginals. Therefore, the relational model is the traditional model of mutual independence. ∎
The next theorem states in general what was seen in the example. Let be the random variables taking values in . Write for the Cartesian product of their ranges, and let .
Theorem 4**.**
Let be the model of AS independence of on the sample space . The interior of the projective closure of this model is the log-linear model of mutual independence of on the sample space .
Proof.
Let be the model matrix for the AS independence:
[TABLE]
The number of columns of is equal to the number of cells in the sample space , . The model is the intersection of the polynomial variety and the interior of the simplex . The variety is non-homogeneous, because among its generators there is at least one non-homogeneous polynomial. In order to obtain the projective closure of (cf. Cox \BOthers., \APACyear2007), include the “no feature present” cell, indexed by [math], to the sample space, choose a Gröbner basis of the ideal , and homogenize all non-homogeneous polynomials in this basis using the cell probability . Because the projective closure of is the minimal homogeneous variety in the projective space whose dehomogenization is (Cox \BOthers., \APACyear2007), Theorem 3(ii) implies that this closure can be described using the matrix
[TABLE]
Each distribution in has the multiplicative structure prescribed by (Klimova \BBA Rudas, \APACyear2016), and during the homogenization, is mapped in a positive distribution in . Because all strictly positive distributions in have the multiplicative structure prescribed by , they comprise the relational model . This matrix describes the model of mutual independence between in the effect coding, and the proof is complete. ∎
The homogenization (in the language of algebraic geometry) or regularization (in the language of the exponential families) leads to a simpler structure, which allows a simpler calculation of the MLE. However, the additional cell was not observed in these cases, and assuming its frequency is zero is ungrounded and may lead to wrong inference.
The framework developed here may also be used to define context specific independence, so that in one context conditional independence holds, in another one, AS independence does. To illustrate, let , , be random variables taking values in . Assume that the outcome is impossible, so the sample space can be expressed as:
[TABLE]
Let , and consider the relational model without the overall effect generated by
[TABLE]
The kernel basis matrix is equal to:
[TABLE]
and thus, the model can be specified in terms of the following two generalized odds ratios:
[TABLE]
The second constraint expresses the (conventional) context-specific independence of and given . The first odds ratio is non-homogeneous, and the corresponding constraint may be seen as the context-specific AS-independence of of and given .
4 ML estimation with and without the overall effect
The properties of the ML estimates under relational models, discussed in detail in Klimova \BOthers. (\APACyear2012) and Klimova \BBA Rudas (\APACyear2016), are summarized here in the language of the linear and multiplicative families defined by the model matrix and its kernel basis matrix. The conditions of existence of the MLE are reviewed first.
Let denote the columns of , and let be the polyhedral cone whose relative interior comprises such , for which there exists a that satisfies . A set of indices is called facial if the columns are affinely independent and span a proper face of (cf. Grünbaum, \APACyear2003; Geiger, Meek\BCBL \BBA Sturmfels, \APACyear2006; Fienberg \BBA Rinaldo, \APACyear2012). It can be shown that a set is facial if and only if there exists a , such that for every and for every .
Let and let be the set of , such that, for a fixed , the linear family
[TABLE]
is not empty, and let \mathcal{F}(\mathbf{A},\mbox{\boldmathq})=\bigcup_{\mathcal{K}}\mathcal{F}(\mathbf{A},\mbox{\boldmathq},\kappa). For each , the linear family \mathcal{F}(\mathbf{A},\mbox{\boldmathq},\kappa) is a polyhedron in the cone .
Theorem 5**.**
(Klimova \BBA Rudas, \APACyear2016)* Let be a relational model, with or without the overall effect, and let be the observed distribution.*
The MLE given exists if only:
- (i)
, or 2. (ii)
* and, for all facial sets of , . *
In either case, \hat{\mbox{\boldmathp}}_{q}=\mathcal{F}(\mathbf{A},\mbox{\boldmathq})\cap int(\mathcal{X}_{\mathbf{A}}), and there exists a unique constant , also depending on , such that:
[TABLE] 2. 2.
The MLE under the extended model \,\hbox{\vbox{\hrule height=0.5pt\kern 2.15277pt\hbox{\kern-2.5ptRM\kern-1.00006pt}}}(\mathbf{A}), defined in (7), always exists and is the unique point of which satisfies:
[TABLE]
The statements follow from Theorem 4.1 in Klimova \BBA Rudas (\APACyear2016) and Corollary 4.2 in Klimova \BOthers. (\APACyear2012), and the proof is thus omitted. The constant , called the adjustment factor, is the ratio between the subset sums of the MLE, \mathbf{A}\hat{\mbox{\boldmathp}}_{q}, and the subset sums of the observed distribution, \mathbf{A}\mbox{\boldmathq}. If the overall effect is present in the model, for all .
Let be a model matrix whose row space does not contain , and let be the matrix obtained by augmenting with the row . It will be shown in the proof of the next theorem that every facial set of is facial for . If the observed is positive, the MLEs \hat{\mbox{\boldmathp}}_{q} and \bar{\mbox{\boldmathp}}_{q} under the models and , respectively, exist. However, as implied by the relationship between the facial sets of and , if has some zeros, the MLE may exist under , but not under , or neither of the MLEs exist.
Theorem 6**.**
Let be a model matrix whose row space does not contain , and let be the matrix obtained by augmenting with the row . Let be the observed distribution. If, given , the MLE under exists, so does the MLE under .
Proof.
If , both MLEs exists.
Assume that has some zeros, that is, , and that the MLE under exists. It will be shown next that for any facial set of , .
The proof is by contradiction. Let be a facial set of , such that . Therefore, there exists a , such that for every and for every .
Denote by the columns of . By construction, , . Let . Then,
[TABLE]
and thus, is a facial set of . Because , the MLE under , given , does not exist, which contradicts the initial assumption. This completes the proof. ∎
Example 5 (revisited): Let be the observed distribution. Because is not a subset of any facial sets of , the MLE exists:
[TABLE]
with .
On the other hand, the set of indices is facial for , and . In this case, the MLE exists only in the extended model \,\hbox{\vbox{\hrule height=0.5pt\kern 2.15277pt\hbox{\kern-2.5ptRM\kern-1.00006pt}}}(\bar{\mathbf{A}}), and is equal to itself.
Let . Because is a subset of a facial set of and of a facial set of , the MLEs exist only in the corresponding extended models. ∎
Further properties of the adjustment factor, including its geometrical meaning, are described next, relying on the following result:
Theorem 7**.**
Let be a model matrix whose row space does not contain , and let be the matrix obtained by augmenting with the row . For any \mbox{\boldmathr}_{1},\>\mbox{\boldmathr}_{2}\in\mathcal{P}, \mbox{\boldmathr}_{1}\neq\mbox{\boldmathr}_{2}, the following holds:
-
-
(i)
The MLEs under , given they exist, are equal if and only if the subset sums entailed by are proportional:
[TABLE]
and the adjustment factors in the MLE satisfy: = . 2. 2.
The MLEs under , given they exist, are equal if and only if the subset sums entailed by coincide:
[TABLE]
The statements are a reformulation of Theorem 4.4 in Klimova \BOthers. (\APACyear2012), and no proofs are provided here. The relationship between the adjustment factors is obvious.
The theorem implies that \mathcal{F}(\mathbf{A},\mbox{\boldmathq}) is an equivalence class in , in the sense that, for any \mbox{\boldmathr}\in\mathcal{F}(\mathbf{A},\mbox{\boldmathq}), the MLE under satisfies \hat{\mbox{\boldmathp}}_{r} = \hat{\mbox{\boldmathp}}_{q}. Each sub-family \mathcal{F}(\mathbf{A},\mbox{\boldmathq},\kappa) is characterized by its unique adjustment factor under . That is, for every , ,
[TABLE]
In addition, for any , and therefore, for a fixed , \mathcal{F}(\mathbf{A},\mbox{\boldmathq},\kappa) is an equivalence class under .
From a geometrical point of view, \mathcal{F}(\mathbf{A},\mbox{\boldmathq}) is a polyhedron which decomposes into polyhedra \mathcal{F}(\mathbf{A},\mbox{\boldmathq},\kappa), with ; clearly, \mbox{\boldmathq}\in\mathcal{F}(\mathbf{A},\mbox{\boldmathq},1). The MLE under given \mbox{\boldmathr}\in\mathcal{F}(\mathbf{A},\mbox{\boldmathq},\kappa) is the unique point common to the polyhedron \mathcal{F}(\mathbf{A},\mbox{\boldmathq},\kappa) and the variety . Among the feasible values of there exists a unique one, say , such that the MLE \bar{\mbox{\boldmathp}}_{r}, \forall\mbox{\boldmathr}\in\mathcal{F}(\mathbf{A},\mbox{\boldmathq},\hat{\kappa}), coincides with the MLE of under , \hat{\mbox{\boldmathp}}_{q}. This happens when so that, from (ii) in Theorem 7, = . This latter point, \hat{\mbox{\boldmathp}}_{q}, is the intersection between \mathcal{F}(\mathbf{A},\mbox{\boldmathq}) and the non-homogeneous variety . This specific value of the adjustment factor , is the adjustment factor of the MLE under given . An illustration is given next.
Relational models for probabilities without the overall effect are curved exponential families, and the computation of the MLE under such models is not straightforward. An extension of the iterative proportional fitting procedure, G-IPF, that can be used for both models with and models without the overall effect was proposed in Klimova \BBA Rudas (\APACyear2015) and is implemented in Klimova \BBA Rudas (\APACyear2014). Alternatively, the MLEs can be computed, for instance, using the Newton-Raphson algorithm or the algorithm of Evans \BBA Forcina (\APACyear2013). One of the algorithms described in Forcina (\APACyear2017) gave an idea of a possible modification of G-IPF. A brief description of the original and modified versions of G-IPF is given below:
[TABLE]
Theorem 8**.**
If , the G-IPFm algorithm converges, and its limit is equal to , the ML estimate of under .
Proof.
The convergence of one iteration of G-IPFm, when is fixed, can be proved similarly to Theorem 3.2 in Klimova \BBA Rudas (\APACyear2015). The limit is positive, , and thus, by Lemma 1 in Forcina (\APACyear2017)f(\gamma)=\boldsymbol{d}_{1}^{\prime}\log{\tilde{\mbox{\boldmathp}}}_{\gamma} is a strictly increasing and differentiable function of . So, one can update , until for some the G-IPFm limit satisfies: . Because, in this case,
[TABLE]
the uniqueness of the MLE implies that \tilde{\mbox{\boldmathp}}_{\gamma_{q}}=\hat{\mbox{\boldmathp}}_{q} and . ∎
The original G-IPF can be used whether or not has some zeros, and it computes a sequence whose elements are the unique intersections of the variety and each of the polyhedra defined by \mathbf{A}\tilde{\mbox{\boldmath\tau}}=\gamma\mathbf{A}\mbox{\boldmathq} for different . This sequence converges, and its limit belongs to the hyperplane (Klimova \BBA Rudas, \APACyear2016). G-IPFm produces a sequence whose elements are the unique intersections of the interior of the homogeneous variety and each of the polyhedra \mathcal{F}(\mathbf{A},\mbox{\boldmathq},\gamma). The limit of this sequence belongs to the interior of the non-homogeneous variety . To ensure the existence, differentiability, and monotonicity of , described above, the G-IPFm algorithm should be applied only when . If has some zero components, the positive MLE \hat{\mbox{\boldmathp}}_{q} may still exist, see Theorem 5(ii). However, for some , because, in general, the matrices and have different facial sets, no strictly positive would satisfy \bar{\mathbf{A}}\boldsymbol{p}_{\gamma}=\left(\begin{array}[]{c}1\\ \gamma\mathbf{A}\boldsymbol{q}\end{array}\right).
Some limitations and advantages of using the generalized IPF were addressed in Klimova \BBA Rudas (\APACyear2015), Section 2. In particular, while the assumption of the model matrix to be of full row rank can be relaxed for G-IPF, it is one of the major assumptions for the Newton-Raphson and the Fisher scoring algorithms. The algorithms proposed in Forcina (\APACyear2017) also require the model matrix to be of full row rank, and their convergence relies on the positivity of the observed distribution.
5 Loss of potentials in hematopoiesis
Hematopoietic stem cells (HSC) are able to become progenitors that, in turn, may develop into mature blood cells. Understanding the process of forming mature blood cells, called hematopoiesis, is one of the most important aims of cell biology, as it may help to develop new cancer treatments. The HSC progenitors can proliferate (produce cells of the same type) or differentiate (produce cells of different types). Multiple experiments suggested that HSC progenitors are multipotent cells and differentiate by losing one of the potentials. While the mature blood cells are unipotent, they do not proliferate or differentiate The differentiation is believed to be a hierarchical process, with HSC progenitors and mature blood cells at the highest and the lowest levels, respectively.
The models discussed below apply to the steady-state of hematopoiesis, under the assumption that cells neither proliferate nor die and can undergo only first phase of differentiation. Various hierarchical models for differentiation have been proposed (cf. Kawamoto, Wada\BCBL \BBA Katsura, \APACyear2010; Ye, Huang\BCBL \BBA Guo, \APACyear2017). The equal loss of potentials (ELP) model was introduced in Perié \BOthers. (\APACyear2014), and is described next. Denote by the three-potential HSC progenitor of the , , and mature blood cell types. During the first phase of differentiation, an progenitor can differentiate by losing either one or two potentials at the same time, and thus produce a cell of one of the six types: , , , , , .
Let be the vector of probabilities of losing the corresponding potentials from :
[TABLE]
For example, is the probability of losing the potential from , is the probability of loosing the potential from , and is the probability of losing the and potentials from at the same time, and so on. The ELP model assumes that “the probability to lose two potentials at the same time is the product of the probability of losing each of the potentials” (see Caption to Fig 3A, Perié \BOthers., \APACyear2014):
[TABLE]
The model specified by (19) is the relational model generated by the matrix
[TABLE]
or, in a parametric form,
[TABLE]
where, using the notation in Perié \BOthers. (\APACyear2014), are the parameters associated with the loss of the corresponding potential from . It can be easily verified that the relational model generated by (20) does not have the overall effect, so the normalization has to be added as a separate condition:
[TABLE]
Perié \BOthers. (\APACyear2014) define the ELP model in the following parametric form:
[TABLE]
That is, the authors rescaled the loss probabilities to force them sum to . In fact, (5) is also a relational model; it is generated by
[TABLE]
and can be obtained by adding the overall effect to the model defined by (20). Because, the original model does not have the overall effect, adding a row of ’s changed this model. One can check by substitution that the probabilities in (5) do not satisfy the multiplicative constraints (19). The estimates of the probabilities of loss of potentials from the cells shown in Figure 3B of Perié \BOthers. (\APACyear2014). In the notation used here,
[TABLE]
These probabilities sum to , but also do not satisfy (19).
Acknowledgments
The authors wish to thank Antonio Forcina for his thought-provoking discussions, Ingmar Glauche and Christoph Baldow for their help with understanding the main concepts of hematopoiesis, and Wicher Bergsma. The second author is also a Recurrent Visiting Professor at the Central European University and the moral support received is acknowledged.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Aitchison \BBA Silvey ( \APA Cyear 1960) \APA Cinsertmetastar Aitch Silvey 60 {APA Crefauthors} Aitchison, J. \BCBT \BBA Silvey, S \BPBI D. \APA Cref Year Month Day 1960. \BBOQ \APA Crefatitle Maximum-likelihood estimation procedures and associated tests of significance Maximum-likelihood estimation procedures and associated tests of significance. \BBCQ \APA Cjournal Vol Num Pages J. Roy. Statist. Soc. Ser.B 22154–171. \Print Back Refs \Current Bib
- 2Andreas \BBA Klein ( \APA Cyear 2015) \APA Cinsertmetastar Andreas Klein 2015 {APA Crefauthors} Andreas, J. \BCBT \BBA Klein, D. \APA Cref Year Month Day 2015. \BBOQ \APA Crefatitle When and why are log-linear models self-normalizing? When and why are log-linear models self-normalizing? \BBCQ \B In \APA Crefbtitle Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Proceedings of the 2015 Confe
- 3Cox \B Others . ( \APA Cyear 2007) \APA Cinsertmetastar Cox {APA Crefauthors} Cox, D \BPBI A., Little, J. \BCBL \BBA O’Shea, D. \APA Cref Year 2007. \APA Crefbtitle Ideals, varieties, and algorithms: an introduction to computational algebraic geometry and commutative algebra Ideals, varieties, and algorithms: an introduction to computational algebraic geometry and commutative algebra. \APA Caddress Publisher New York Springer. \Print Back Refs \Current Bib
- 4Evans \BBA Forcina ( \APA Cyear 2013) \APA Cinsertmetastar Evans Forcina 11 {APA Crefauthors} Evans, R \BPBI J. \BCBT \BBA Forcina, A. \APA Cref Year Month Day 2013. \BBOQ \APA Crefatitle Two algorithms for fitting constrained marginal models Two algorithms for fitting constrained marginal models. \BBCQ \APA Cjournal Vol Num Pages Comput. Statist. Data Anal.661–7. \Print Back Refs \Current Bib
- 5Fienberg \BBA Rinaldo ( \APA Cyear 2012) \APA Cinsertmetastar Fienberg Rinaldo 2012 {APA Crefauthors} Fienberg, S \BPBI E. \BCBT \BBA Rinaldo, A. \APA Cref Year Month Day 2012. \BBOQ \APA Crefatitle Maximum likelihood estimation in log-linear models Maximum likelihood estimation in log-linear models. \BBCQ \APA Cjournal Vol Num Pages Ann. Statist.40996–1023. \Print Back Refs \Current Bib
- 6Forcina ( \APA Cyear 2017) \APA Cinsertmetastar Forcina 2017 {APA Crefauthors} Forcina, A. \APA Cref Year Month Day 2017. \APA Crefbtitle Estimation for multiplicative models under multinomial sampling. Estimation for multiplicative models under multinomial sampling. \APA Caddress Publisher ar Xiv:1704.06762. \Print Back Refs \Current Bib
- 7Geiger \B Others . ( \APA Cyear 2006) \APA Cinsertmetastar Geiger Meek Sturm 2006 {APA Crefauthors} Geiger, D., Meek, C. \BCBL \BBA Sturmfels, B. \APA Cref Year Month Day 2006. \BBOQ \APA Crefatitle On the toric algebra of graphical models On the toric algebra of graphical models. \BBCQ \APA Cjournal Vol Num Pages Ann. Statist.341463–1492. \Print Back Refs \Current Bib
- 8Grünbaum ( \APA Cyear 2003) \APA Cinsertmetastar Grunbaum Convex {APA Crefauthors} Grünbaum, B. \APA Cref Year 2003. \APA Crefbtitle Convex polytopes Convex polytopes. \APA Caddress Publisher Springer. \Print Back Refs \Current Bib
