Efficient EM Estimation for the Pogit Model via Polya-Gamma Augmentation

Iván Gutiérrez; Sandra Ramírez; Leonardo Jofré

PMC · DOI:10.3390/e28020207·February 11, 2026

Efficient EM Estimation for the Pogit Model via Polya-Gamma Augmentation

Iván Gutiérrez, Sandra Ramírez, Leonardo Jofré

PDF

Open Access

TL;DR

The paper introduces a new and efficient method for estimating parameters in the pogit model using a Polya-Gamma augmentation approach, enabling faster and scalable analysis of large count datasets.

Contribution

A novel EM algorithm for the pogit model using Polya-Gamma augmentation, enabling efficient and scalable estimation for large datasets.

Findings

01

The proposed EM algorithm has low per-iteration cost and supports computational enhancements like mini-batch implementations.

02

Simulation and real-data applications show substantial computational improvements without loss of statistical accuracy.

03

The method is a scalable and competitive alternative to existing maximum-likelihood optimization routines for large-scale pogit estimation.

Abstract

The Poisson-logistic (pogit) model is widely used for count data with latent intensities, with applications including under-reporting correction and share-of-wallet estimation, yet existing estimation methods do not scale well to large datasets. We propose a new expectation-maximization (EM) algorithm for the standard pogit model based on Polya-Gamma data augmentation, which yields a conditionally Gaussian complete-data likelihood with closed-form EM-updates. The resulting EM algorithm has low per-iteration cost and naturally accommodates computational enhancements, including quasi-Newton acceleration and mini-batch implementations. These features enable efficient inference on datasets with millions of observations. Simulation studies and real-data applications demonstrate substantial computational improvements without loss of statistical accuracy, and comparisons with direct…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species2

Malus domestica(apple · species)Homo sapiens(human · species)

Chemicals2

Polya EM

Diseases1

injury to

Figures9

Click any figure to enlarge with its caption.

Funding2

—Pontificia Universidad Javeriana, Cali, Colombia
—CONICYT PFCHA/Doctorado Becas Chile

Keywords

EM algorithmpogitunder-reportingshare-of-wallet

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Bayesian Inference · Bayesian Methods and Mixture Models · Statistical Methods and Inference

Full text

1. Introduction

Count data models (see, e.g., [1]) with latent exposure or reporting mechanisms are common in many empirical settings, including marketing analytics [2], epidemiology [3], official statistics [4], and gender-violence research [5,6]. Among these, the Poisson-logistic (pogit) model [7] has emerged as a flexible and interpretable framework for modeling observed counts subject to under-reporting or partial observability. By combining a binomial observation equation with a Poisson exposure model, the pogit specification allows researchers to disentangle reporting behavior from the underlying intensity process, while retaining a regression structure that facilitates inference and interpretation.

The pogit model has found applications in diverse areas. In studies of under-reporting, it provides a principled way to account for missing or censored events by explicitly modeling the probability that an event is observed (see, e.g., [7]). In marketing and consumer analytics, it has been used for share-of-wallet (SoW) estimation, where observed purchases represent a noisy subset of latent purchase opportunities (see, e.g., [2]). These applications have motivated a growing methodological literature, including extensions to negative-binomial exposures [4], as well as parametric and nonparametric Bayesian formulations that incorporate prior information and/or variable selection (see, e.g., [5,8,9]).

Despite these advances, a major practical limitation of existing pogit methodologies is their lack of scalability. While the latent count can be analytically marginalized and maximum-likelihood estimation can in principle be based on the observed data likelihood, the resulting objective function is highly nonlinear and tightly couples the reporting and intensity components through a multiplicative structure. This makes direct likelihood maximization numerically delicate in practice, particularly as sample size and covariate dimension grow, or when parameters are weakly identified. Bayesian approaches face related difficulties, as Markov chain Monte Carlo methods must explore high–dimensional posteriors with strong dependence across model components. As a consequence, pogit models remain difficult to deploy in large–scale applications, despite their conceptual suitability for precisely such settings.

In this article, we address this literature gap by introducing a new expectation-maximization (EM) algorithm [10] for efficient estimation of the standard pogit model. Our approach builds on Polya-Gamma data augmentation [11], which yields conditionally Gaussian complete-data likelihoods and enables closed-form updates in the M-step. Specifically, we adapt the Polya-Gamma-based EM framework developed by Scott et al. [12] for logistic regression to the pogit setting, combining it with the approximate augmentation strategy for Poisson models introduced by D’Angelo et al. [13]. This synthesis results in an EM algorithm with a simple structure and closed–form updates throughout. Crucially, the resulting procedure is naturally amenable to online and mini-batch variants, making it possible to fit pogit models to datasets of unprecedented size using streaming data, without sacrificing statistical efficiency.

The remainder of the article is organized as follows. Section 2 introduces the pogit model and reviews its likelihood structure. Section 3 presents the proposed Polya-Gamma-based EM algorithm and discusses its computational properties, including quasi-Newton acceleration and mini-batch extensions for large-scale data. Section 4 describes additional computational enhancements. Section 5 reports results from simulation studies assessing convergence, finite-sample performance, and computational efficiency, with comparisons to direct numerical maximum-likelihood estimation. Section 6 applies the method to real datasets. Section 7 concludes.

Contributions

This article makes the following contributions:

We introduce a new expectation–maximization algorithm for the standard pogit model based on Polya-Gamma data augmentation. By combining an exact augmentation for the binomial component with a controlled approximation for the Poisson component, the complete–data log–likelihood becomes quadratic in the regression parameters. This yields closed-form expressions for all E-step expectations and reduces the M-step to simple weighted least-squares updates, resulting in a fully analytic EM procedure with low per-iteration computational cost.
We show that the resulting EM algorithm admits scalable online and mini-batch variants. In particular, the method can be applied to datasets with millions of observations using mini-batch updates, making pogit models feasible in large-scale applications where existing methods break down.
We evaluate the statistical and computational performance of the proposed estimator using both simulated and real datasets. The results demonstrate fast convergence, stable behavior across sample sizes, and competitive estimation accuracy.
We provide a systematic comparison with direct numerical maximization of the observed-data likelihood using generic maximum-likelihood routines, showing that the proposed method delivers substantial runtime improvements while exhibiting stable finite-sample behavior, robust parameter recovery, and numerical stability.

2. Model

2.1. Hierarchical Specification

We consider the standard Poisson-logistic (pogit) model for count data with latent intensity and partial observability. For each observational unit $[eqn]$ , let $[eqn]$ denote the observed count and $[eqn]$ an unobserved latent count representing the total number of underlying events or opportunities. The model is defined hierarchically as

[eqn]

where $[eqn]$ denotes conditional independence across observational units, conditional on the model parameters and covariates, $[eqn]$ is a known offset, $[eqn]$ is the probability that a latent event is observed, and $[eqn]$ is the latent intensity. The first-level binomial equation captures partial observability: conditional on $[eqn]$ , only a fraction $[eqn]$ of events is recorded. The second-level Poisson equation models heterogeneity in the total number of latent events across observational units.

A key property of the pogit model is that the latent count admits a simple predictive distribution, as stated in the following property:

Property 1.

For each observational unit $[eqn]$ in a pogit model, $[eqn]$ .

This result follows from the Poisson thinning property (see, e.g., [14]) and plays an important role in the EM algorithm developed below.

2.2. Regression Structure

Both the reporting probability and the latent intensity are linked to covariates through separate regression specifications:

[eqn]

where $[eqn]$ denotes the linear predictor associated with component j for observational unit i, $[eqn]$ is a column vector of observed covariates (with $[eqn]$ denoting its transpose), $[eqn]$ is the corresponding vector of regression coefficients, $[eqn]$ denotes the number of covariates entering component j, and $[eqn]$ is the sigmoid (or expit) function. This specification allows the reporting mechanism and the latent intensity to depend on distinct, possibly overlapping, covariate sets. Such flexibility is essential in applications such as under-reporting correction and share-of-wallet estimation. Moreover, the chosen link functions yield interpretable parameters and facilitate likelihood augmentation.

Throughout the paper, the model is defined in terms of the natural parameters $[eqn]$ , while many derivations and algorithmic steps are expressed in terms of the associated linear predictors $[eqn]$ , where $[eqn]$ and $[eqn]$ . This reparameterization is purely deterministic and one-to-one. For notational simplicity and computational convenience, we work with whichever representation is more appropriate in a given context, and the mapping between the two is made explicit whenever it is used.

Figure 1 provides a graphical representation of the model using plate notation [15]. Regression coefficients $[eqn]$ are shared across observational units, while $[eqn]$ and $[eqn]$ are unit-specific latent and observed variables.

2.3. Observed Likelihood and Identification

Marginalizing over the latent count $[eqn]$ yields

[eqn]

see, e.g., Kingman [14]. The observed-data likelihood, therefore, depends on the parameters only through the product $[eqn]$ . As a consequence, without additional structure, the reporting probability and the latent intensity are not separately identified. In particular, for any $[eqn]$ , the transformation $[eqn]$ and $[eqn]$ leaves the likelihood unchanged.

Identification is restored by the regression structure, which links $[eqn]$ and $[eqn]$ to covariates through distinct linear predictors. Under mild regularity conditions, this structure ensures local identification of the parameter vector $[eqn]$ , that is, that there is a neighborhood of the true $[eqn]$ without observationally equivalent parameter values (in the sense of producing the exact same likelihood). The identification result is not merely technical. Local identification is a necessary condition for the observed-data likelihood to admit a locally unique maximizer in a neighborhood of the true parameter value, thereby ruling out flat ridges generated by observationally equivalent parameterizations. This property underpins meaningful likelihood-based inference by ensuring that distinct parameter values correspond to distinct data-generating processes in a local neighborhood. While local identification does not preclude the presence of nearly flat regions of the likelihood, it guarantees local uniqueness of the true parameter, which is a necessary condition for well-defined EM-based estimation. A formal definition of local identification and the associated regularity concepts are provided in Appendix A.

Theorem 1.

Let $[eqn]$ , and let $[eqn]$ and $[eqn]$ . Suppose $[eqn]$ is an i.i.d. sequence such that $[eqn]$ is positive definite. Then the pogit parameters are locally identified.

Proof. See Appendix A. □

Remark 1 (Overlap and weak identification). Local identification may hold even when $[eqn]$ and $[eqn]$ overlap. However, strong overlap or collinearity renders the Fisher information matrix nearly singular, leading to weak identification in finite samples. Applied work, therefore, often relies on distinct covariates across hierarchical levels (see, e.g., [2]), or incorporates auxiliary data in which the counts $[eqn]$ are observed (see, e.g., [8]).

Remark 2 (Local versus global identification). Theorem 1 establishes local, but not global, identification. As shown by Brennan et al. [16], when covariates entering the intensity equation are a subset of those entering the reporting equation, the model may be locally identified while remaining globally unidentified. Common remedies include reducing covariate overlap, incorporating auxiliary observations on latent counts, or imposing sign restrictions informed by expert judgment.

From a computational perspective, the observed-data likelihood is highly nonlinear in $[eqn]$ . Direct maximization requires numerical optimization and becomes increasingly costly as the sample size and covariate dimension grow. Moreover, the multiplicative interaction between $[eqn]$ and $[eqn]$ often leads to instability in large samples. These features motivate the search for an augmented representation with simpler structure.

2.4. A Naive EM Algorithm

The hierarchical formulation naturally suggests an EM algorithm based on the augmented likelihood $[eqn]$ , that is

[eqn]

Given a current iterate $[eqn]$ , such an algorithm replaces $[eqn]$ by $[eqn]$ in the E-step and updates $[eqn]$ via binomial and Poisson regressions in the M-step. While formally valid, this approach is computationally unattractive: both regressions require iterative solvers, so each EM iteration contains nested optimization loops. As a result, the procedure scales poorly and offers little computational advantage over direct likelihood maximization.

The key insight of this article is that augmenting only with the latent counts $[eqn]$ is insufficient to obtain a scalable EM algorithm, because the resulting complete likelihood remains non-quadratic in the regression parameters.

To overcome this limitation, we exploit the fact that both components of the pogit model admit conditionally Gaussian representations under suitable augmentations. The binomial component admits an exact Polya-Gamma augmentation, while the Poisson component can be accurately approximated by a negative-binomial pmf that also yields a Polya-Gamma augmentation.

Introducing Polya-Gamma variables in addition to the latent counts renders the complete log-likelihood quadratic in $[eqn]$ . The resulting EM algorithm features closed-form E-step expectations and M-steps that reduce to weighted least-squares problems with explicit solutions. This structure yields low per-iteration cost and naturally accommodates large-scale extensions. The construction of this augmented likelihood and the corresponding EM updates are developed in the next section.

3. An Improved EM Algorithm

In this section, we present a scalable expectation–maximization (EM) algorithm for the pogit model that exploits Polya-Gamma augmentation to obtain a quadratic complete–data log-likelihood and closed-form updates in both the E- and M-steps. As mentioned in Section 2, the algorithm is based on the Polya-Gamma distribution, so we start by explaining this distribution and its key properties.

3.1. The Polya-Gamma Distribution

The Polya-Gamma distribution was introduced by Polson et al. [11] as part of a new data augmentation for logistic regression models. A random variable $[eqn]$ is said to follow a Polya-Gamma distribution with parameters $[eqn]$ , denoted $[eqn]$ , where $[eqn]$ and $[eqn]$ , if it admits the representation

[eqn]

where $[eqn]$ are independent $[eqn]$ random variables.

The Polya-Gamma distribution has two key properties.

Property 2.

For any $[eqn]$ and $[eqn]$ ,

[eqn]

where $[eqn]$ and $[eqn]$ .

This property will be useful for our EM algorithm, as it transforms binomial and negative-binomial likelihoods into Gaussian kernels conditional on the latent variable w.

Property 3.

Let $[eqn]$ . Then, $[eqn]$ for any $[eqn]$ and $[eqn]$ otherwise, where $[eqn]$ .

This property will be particularly convenient for our EM algorithm, as it will allow the E-step to be computed analytically without numerical integration.

3.2. The Augmented Likelihood

We now derive our likelihood augmentation. The central idea is to augment the pogit model so that the complete-data log-likelihood becomes quadratic in the regression parameters $[eqn]$ , yielding closed-form updates in the M-step and eliminating inner optimization loops. For ease of exposition, we present the main steps of the likelihood augmentation here; the full derivation is deferred to Appendix B.

Consider the ith contribution to the naive complete likelihood,

[eqn]

We treat the two components separately.

Before deriving the augmented likelihood contributions, recall that $[eqn]$ and $[eqn]$ denote the linear predictors associated with the binomial (reporting) and latent intensity components, respectively.

3.2.1. Binomial Component

Using the Polya-Gamma identity in Property 2 and rearranging terms, the binomial likelihood can be written as

[eqn]

where $[eqn]$ denotes expectation with respect to $[eqn]$ , $[eqn]$ , and the symbol ∝ denotes equality up to a multiplicative constant independent of the parameters of interest. Therefore, conditional on $[eqn]$ , the binomial likelihood contribution can be expressed as an exponential quadratic form in the linear predictor $[eqn]$ , that is, as a Gaussian kernel (up to a normalizing constant).

3.2.2. Poisson Component

Unlike the binomial likelihood, the Poisson likelihood does not admit an exact Polya-Gamma representation. However, it is well-known that the $[eqn]$ distribution converges in distribution to the $[eqn]$ distribution as $[eqn]$ (see, e.g., [14]). Unlike the binomial likelihood, the Poisson likelihood does not admit an exact Polya-Gamma representation. We, therefore, approximate it using a negative-binomial distribution with parameter $[eqn]$ , where $[eqn]$ controls the accuracy of the approximation.

Hence, we can approximate

[eqn]

for some $[eqn]$ . Using Property 2 and simplifying, we obtain

[eqn]

where $[eqn]$ denotes expectation with respect to $[eqn]$ , $[eqn]$ . Therefore, conditional on $[eqn]$ , the approximated Poisson likelihood can be expressed as an exponential quadratic form in the linear predictor $[eqn]$ , that is, as a Gaussian kernel (up to a normalizing constant). This representation is key to obtaining closed-form updates in the M-step.

Remark 3.

The use of a negative-binomial approximation to enable Polya-Gamma augmentation for Poisson models was first proposed by D’Angelo et al. [13] in the context of Bayesian Poisson regression, and shown to improve computational efficiency relative to earlier approaches.

Remark 4.

Although the proposed augmentation is based on an approximation rather than an exact identity, its quality is fully controllable: accuracy can be made arbitrarily high by increasing $[eqn]$ , with the only practical limitation being numerical stability. In practice, even moderate values of $[eqn]$ already yield excellent approximations; in our experiments, $[eqn]$ produced indistinguishable parameter estimates. For fixed $[eqn]$ , the proposed procedure is an exact EM algorithm for a well-defined approximating likelihood based on a negative-binomial representation of the Poisson component, and this likelihood converges pointwise to the pogit likelihood as $[eqn]$ .

3.2.3. Augmented Log-Likelihood

Introducing the aforementioned $[eqn]$ ’s to the model, we obtain the following augmented log-likelihood:

[eqn]

where ≐ denotes equality up to additive constants that do not depend on the parameters of interest (in this case, $[eqn]$ ).

3.3. EM Steps

Let $[eqn]$ denote the current iterate of $[eqn]$ . As with any EM algorithm, our procedure updates this value following two general steps:

E-step: compute the Q-function, $[eqn]$ .
M-step: update $[eqn]$ by maximizing the Q-function.

As we shall see, both steps admit closed-form expressions, leading to a simple and efficient EM algorithm.

3.3.1. E-Step

First, note that $[eqn]$ depends on $[eqn]$ only through $[eqn]$ . In particular,

[eqn]

where $[eqn]$ denotes the conditional expectation of any random variable a (for instance, $[eqn]$ ). We now explain how to compute $[eqn]$ and $[eqn]$ .

Computing $[eqn]$ . This is straightforward as it is linear in $[eqn]$ :

[eqn]

Computing $[eqn]$ . This is also straightforward. Property 1 implies that $[eqn]$ is equal to $[eqn]$ , so evaluating at $[eqn]$ gives

[eqn]

Computing $[eqn]$ . This is more challenging, but it is well-known that

$[eqn]$ [11]. $[eqn]$ [13].

Hence, using Property 3 and the law of iterated expectations, we obtain

[eqn]

In summary, $[eqn]$ has a closed-form expression.

Remark 5.

In order to apply the E-step successfully, the functions $[eqn]$ and $[eqn]$ must be evaluated in a numerically stable way. In particular, we use

[eqn]

and

[eqn]

In our experiments, $[eqn]$ worked fine.

Remark 6.

Exploiting the closed-form mean of the Polya-Gamma distribution within an expectation-maximization framework was first proposed by Scott et al. [12] in the context of logistic regression, where it was shown to yield substantial computational gains relative to naive optimization strategies.

3.3.2. M-Step

As $[eqn]$ is quadratic in $[eqn]$ , maximization yields closed-form solutions. For $[eqn]$ , let $[eqn]$ denote the design matrix collecting the covariates $[eqn]$ associated with component j across all observational units. By standard least-squares theory, the solution is given by

[eqn]

where $[eqn]$ is a diagonal matrix with entries $[eqn]$ .

In this way, each EM iteration reduces to two independent weighted least-squares problems that require no any inner iterative procedure. In addition, each update has a structure particularly well suited for parallel and mini-batch extensions.

Remark 7.

The Polya-Gamma augmentation used here is closely related to constructions commonly exploited in variational Bayes (VB) methods (see, e.g., [17]) for logistic models [18]. The present approach, however, differs in both objective and implementation. Our algorithm is an expectation-maximization procedure that directly targets the observed-data likelihood of the pogit model, rather than a variational lower bound, and imposes no factorization assumptions on latent variables. All conditional expectations in the E-step are computed exactly under the augmented model, and stochastic approximation is used solely to scale the computation of sufficient statistics in large samples. As a result, the method retains the likelihood-based interpretation and asymptotic properties of maximum-likelihood estimation, while achieving computational efficiency comparable to VB approaches.

4. Computational Enhancements

The EM algorithm introduced in Section 3 has two key computational advantages: each iteration has low cost due to closed-form updates, and the algorithm admits a simple representation in terms of sufficient statistics. Nevertheless, two practical challenges remain in large-scale applications. First, when the sample size N is very large, even inexpensive full-batch iterations can become costly. Second, like most fixed-point algorithms, EM may converge slowly when the likelihood surface is flat or parameters are weakly identified.

In this section, we address these challenges using two complementary strategies. To scale the algorithm to massive datasets, we develop a mini-batch EM scheme based on Robbins–Monro stochastic approximation. To accelerate convergence (even in samples of moderate size), we combine the EM updates with a quasi-Newton extrapolation technique known as SQUAREM. These enhancements target distinct computational bottlenecks and can be used independently or in combination.

4.1. Mini-Batch EM via Robbins–Monro

To reduce computational cost when N is large, we adopt an online EM approach based on stochastic approximation [19], following the general framework of Cappé and Moulines [20]. The key observation is that the EM algorithm derived in Section 3 depends on the data only through additive sufficient statistics, making it particularly well suited for mini-batching.

Recall that the M-step for component $[eqn]$ depends on the statistics

[eqn]

Given a random mini-batch $[eqn]$ , unbiased estimators are

[eqn]

where randomness arises solely from subsampling.

A naive approach would replace $[eqn]$ directly with their mini-batch counterparts. However, although unbiased, these estimators exhibit persistent sampling noise that prevents convergence of the resulting EM iterations. To stabilize the procedure, Cappé and Moulines [20] propose updating the sufficient statistics using diminishing step sizes, that is

[eqn]

where $[eqn]$ , $[eqn]$ , and the step size sequence $[eqn]$ shrinks in such a way that $[eqn]$ and $[eqn]$ .

In practice, we use

[eqn]

with $[eqn]$ , $[eqn]$ , and $[eqn]$ . The initial constant phase stabilizes early iterations, while the polynomial decay ensures asymptotic convergence.

To further reduce variance and improve finite-sample performance, we apply Polyak averaging [21,22] to the parameter iterates after burn-in. Specifically, the reported estimator is

[eqn]

which achieves optimal asymptotic variance for stochastic approximation schemes [20].

This mini-batch EM procedure preserves the structure of the exact Polya-Gamma EM updates while reducing per-iteration complexity to $[eqn]$ , independent of the total sample size. As a result, the method scales naturally to datasets with millions of observations.

Remark 8.

Mini–batch EM algorithms based on stochastic approximation may require a large number of iterations and can exhibit substantial variability in the raw parameter iterates. For this reason, we recommend using a sufficiently large iteration budget (at least $[eqn]$ iterations in our experiments) and assessing convergence in terms of the Polyak–averaged estimator $[eqn]$ rather than the instantaneous iterate $[eqn]$ . This practice stabilizes inference and aligns with standard recommendations in stochastic approximation.

Remark 9.

Unlike full-batch EM, mini-batch EM does not guarantee a monotone increase in the observed-data likelihood. While monotonicity can in principle be enforced via safeguarding steps, our simulation experiments indicate that the proposed algorithm exhibits stable behavior without such modifications. For this reason, and to preserve computational simplicity, we adopt the unsafeguarded version in our implementation.

4.2. Quasi-Newton Acceleration via SQUAREM

Even in full-batch settings, EM algorithms may converge slowly when the likelihood surface is flat or parameters are weakly identified. To accelerate convergence, we complement our method with a quasi-Newton extrapolation technique known as SQUAREM [23].

Let $[eqn]$ denote the EM update mapping induced by one full E- and M-step, so that the standard EM iteration is $[eqn]$ . SQUAREM treats EM as a fixed-point iteration and constructs an accelerated update by extrapolating along the EM trajectory. Starting from $[eqn]$ , define

[eqn]

Here, r represents the first-order displacement induced by EM, while v captures curvature by measuring deviations from linearity along the EM path.

The accelerated update is then given by

[eqn]

where the step size $[eqn]$ is chosen to approximately minimize the norm of the residual. Following Varadhan and Roland [23], we set

[eqn]

A safeguarding step ensures monotonicity of the observed log-likelihood, reverting to the standard EM update whenever the extrapolated iterate decreases the likelihood.

SQUAREM is particularly well-suited to our Polya-Gamma-based EM algorithm. First, each EM iteration is deterministic and inexpensive, making the cost of additional EM evaluations negligible relative to the gains in convergence speed. Second, the M-step consists of weighted least-squares updates, which vary smoothly with the parameters and favor quasi-Newton extrapolation. Third, the method is entirely generic and requires no modification of the underlying EM structure.

In our empirical experiments, SQUAREM substantially reduces the number of EM iterations required to reach convergence, often by an order of magnitude, while preserving numerical stability. For this reason, we recommend SQUAREM as the default acceleration strategy when fitting the pogit model in moderate to large samples.

5. Simulation Study

This section evaluates the proposed EM algorithm along three complementary dimensions. First, we study its finite-sample estimation behavior under moderate sample sizes, focusing on signal recovery and dispersion across Monte Carlo replications. Second, we assess numerical robustness with respect to the negative-binomial approximation parameter underlying the Polya-Gamma augmentation. Third, we examine computational scalability in large samples and quantify the gains achieved by standard acceleration techniques. Together, these experiments are designed to validate the statistical accuracy, numerical stability, and practical scalability of the proposed estimation framework.

Remark 10.

All simulations were conducted on a desktop computer equipped with an Intel® Core™ i7-9750H CPU (6 cores, 2.60GHz) and 16 GB of RAM, running Windows 11 (64-bit).

5.1. Finite-Sample Estimation Behavior

We begin by examining the finite-sample behavior of the EM estimator in moderate sample sizes. In all scenarios, each model component includes nine covariates (i.e., $[eqn]$ ). For each observational unit i, the covariate vectors are generated independently according to $[eqn]$ . The scaling factor $[eqn]$ is chosen so that, if all regression coefficients are set equal to one, the resulting linear predictors $[eqn]$ have variance of moderate magnitude. This normalization keeps the linear predictors within a numerically stable range for the logistic and exponential link functions throughout the simulations.

The true parameter vectors are fixed at

[eqn]

so that only the first two coefficients in each vector are nonzero. This design induces a sparse signal structure and allows us to assess signal recovery.

Throughout the simulation study, the exposure term is fixed at $[eqn]$ for all i. We consider two sample sizes: $[eqn]$ (Scenario 1) and $[eqn]$ (Scenario 2). For each scenario, $[eqn]$ independent datasets are generated from the data-generating process described above. In each replication, the EM algorithm is initialized randomly and iterated until convergence, with a maximum of 1000 iterations allowed. Convergence is declared when the maximum relative change across all parameter components between successive iterations falls below $[eqn]$ .

Figure 2 summarizes the resulting Monte Carlo distributions of the EM estimates. The nonzero coefficients in $[eqn]$ and $[eqn]$ are accurately recovered in both scenarios, while coefficients that are truly zero remain tightly concentrated around zero. Increasing the sample size from 500 to $[eqn]$ leads to a visible reduction in dispersion across replications, indicating improved estimator concentration in larger samples.

5.2. Robustness to the Approximation Parameter

Having established stable finite-sample behavior, we next examine robustness with respect to the tuning parameter r controlling the accuracy of the negative-binomial approximation used to obtain a Polya-Gamma augmentation of the Poisson component. While larger values of r yield a closer approximation to the Poisson likelihood, excessively large choices may introduce unnecessary numerical overhead and potential numerical issues. Evaluating sensitivity to this parameter is, therefore, important for practical implementation.

Figure 3 reports Monte Carlo boxplots for selected coefficients (indices 1 and 2) of $[eqn]$ and $[eqn]$ across values $[eqn]$ , with $[eqn]$ for all i. Across all panels, the empirical distributions remain stable as r varies. While Monte Carlo variability is present, no systematic shifts or trends are observed as r increases from relatively small to moderately large values.

These results indicate that the proposed EM algorithm is not unduly sensitive to the precise choice of the tuning parameter, provided that r is chosen sufficiently large to yield an accurate quadratic approximation of the Poisson component. In particular, moderate values such as $[eqn]$ , which we adopt as a default in subsequent experiments, already yield estimation behavior comparable to that observed for larger values.

Remark 11.

To complement the results of this particular experiment, we included three additional figures in Appendix C, showing the sensitivities of the estimates’ standard errors, the observed likelihood and the effective number of iterations to r. Together, these figures show that the effective number of iterations grows with r, but standard errors and the log-likelihood tend to stabilize at $[eqn]$ . These supplementary results strengthen our practical recommendation of $[eqn]$ .

5.3. Computational Scalability and Acceleration

Having established finite-sample accuracy and numerical robustness, we now turn to computational performance. These experiments focus on scalability in large samples and are designed to assess whether the proposed EM algorithm and its accelerated variants remain practical in regimes where existing pogit implementations become computationally prohibitive.

We consider three complementary comparisons. First, we benchmark the proposed deterministic EM algorithm against direct maximum-likelihood (ML) optimization of the observed-data likelihood. Second, we evaluate the impact of mini-batch estimation via Robbins–Monro updates on runtime scalability. Third, we assess the gains from quasi-Newton acceleration using SQUAREM. In all cases, methods target the same likelihood and are run under identical initialization/stopping rules. For the Robbins–Monro mini-batch EM variant, stepsize sequences were chosen according to standard diminishing-stepsize conditions; all tuning parameters are reported in Supplementary File S1.

Figure 4 and Figure 5 report total runtime (measured in seconds) as a function of sample size (measured in hundreds of thousands of observations). Figure 4 compares the proposed EM algorithm with direct ML optimization (implemented in Julia using BFGS via the R function stats::optim(); Figure 5 compares standard EM with its SQUAREM-accelerated version and Figure 6 compares full-batch EM with a Robbins–Monro mini-batch variant.

Across all methods, runtime grows approximately linearly with sample size over the ranges considered. The proposed EM algorithm consistently outperforms direct ML optimization, with a consistent runtime advantage across all sample sizes considered. The Robbins–Monro variant further reduces runtime in very large samples by lowering per-iteration cost, while SQUAREM substantially decreases the number of iterations required for convergence. These gains are achieved without compromising numerical stability or likelihood monotonicity.

Overall, the simulation results demonstrate that the proposed Polya-Gamma-based EM algorithm combines stable finite-sample behavior, robustness to approximation choices, and scalability to large datasets. These properties address key practical limitations of existing pogit implementations and support the use of the method in modern large-scale applications.

Finally, to facilitate transparency and reproducibility of the computational evidence reported above, we provide the Julia version 1.12.3 and R version 4.4.2 code used to generate all simulation results in Supplementary File S1.

6. Real-World Application

We apply the proposed pogit model, estimated via a Pólya-Gamma-based EM framework, to an openly available dataset derived from Amazon purchase histories [24], with a specific focus on purchases of Apple products within the technology and electronics category. The purpose of this empirical application is to illustrate the use of the model and the practical implementation of the proposed estimation procedure in a realistic, large-scale empirical setting.

This application is not intended to demonstrate superior empirical fit or to provide a comprehensive benchmarking exercise. Rather, it is designed to illustrate the feasibility, interpretability, and numerical stability of the proposed EM framework when applied to a challenging publicly available dataset characterized by severe partial observability and heterogeneous consumer behavior.

The dataset combines detailed longitudinal Amazon purchase histories with respondent-level demographic and household information for $[eqn]$ users in the United States. Data were collected through a consent-centric crowdsourcing protocol, under which participants voluntarily provided exports of their personal Amazon purchase histories spanning 1 January 2018 to 19 March 2023, together with survey-based sociodemographic information. This design yields a rich observational dataset linking transactional behavior over time to individual and household characteristics, while ensuring informed consent and preserving user privacy.

A comprehensive descriptive and exploratory analysis of this dataset has been previously reported in Berke et al. [25]. The dataset and its construction are described in detail in Berke et al. [24] and have since been used in several independent studies examining distinct dimensions of consumer behavior (e.g., [26,27,28]). Building on this established empirical foundation, the present study does not revisit descriptive statistics or exploratory analyses previously reported in the literature, and instead proceeds directly to the model-based analysis under partial observability.

In the present study, the dataset is used under a deliberately constructed information scenario that mirrors the decision environment faced by a focal firm. Although the underlying data contain transactions for multiple brands within the technology and electronics category, the estimation strategy intentionally restricts the information available for model fitting to Apple-specific transactional histories observed on Amazon, together with respondent-level sociodemographic and household characteristics. This restriction is methodological rather than data-driven and reflects the realistic constraint that firms typically lack access to competitors’ sales and to consumers’ total category expenditure. Within this setting of partial observability, the pogit model illustrates how latent wallet allocation and underlying demand intensity can be inferred using focal-firm transaction data alone.

The model specification captures two distinct latent behavioral mechanisms. An intensity component governs the customer’s overall demand for technology and electronics, defined on a normalized latent scale and corresponding to the Size-of-Wallet (SioW). An allocation component governs the fraction of this latent category demand allocated to the focal firm, corresponding to the share-of-wallet (SoW). Identification under partial observability is achieved through the joint hierarchical structure of the model and a deliberate separation of covariates across these latent components. Apple-specific transactional and behavioral features enter the allocation mechanism, whereas broader sociodemographic and household characteristics enter the intensity mechanism.

In this empirical application, all customers are observed over the same calendar window and are, therefore, assumed to face identical exposure. Accordingly, the offset term is fixed at a constant value, $[eqn]$ for all i, and cross-sectional heterogeneity in overall purchasing activity is captured entirely through the latent intensity $[eqn]$ .

To operationalize the model in a count-data setting, we transform Apple expenditure observed on Amazon during the evaluation window $[eqn]$ (1 November 2021 to 19 March 2023) using a variance-to-mean scaling motivated by the moment structure of the Poisson distribution; a model-based and data-driven justification of this transformation is provided in Appendix D. Specifically, total monetary spending is rescaled by a constant c, estimated exclusively from the training sample, and subsequently rounded to obtain a count-like response variable compatible with the pogit specification. This transformation induces a normalized discrete scale shared by the observed response $[eqn]$ and the latent exposure $[eqn]$ , thereby allowing monetary purchase volumes to be modeled coherently within a Poisson–binomial structure. Importantly, neither $[eqn]$ nor $[eqn]$ should be interpreted as literal counts of physical transactions, orders, or items; both quantities represent discretized proxies for relative spending intensity defined on a common latent scale.

Predictors are computed exclusively from transactions in $[eqn]$ (1 January 2018 to 31 October 2021), whereas the response is computed exclusively from transactions in $[eqn]$ (1 November 2021 to 19 March 2023); the train/test partition is defined at the respondent level and applied consistently across both windows. Categorical predictors encoded as single-selection factors are included using an omitted reference level, which defines the reference group, whereas for multiple-selection categorical variables, all indicator categories are retained. All quantitative covariates are standardized prior to estimation. For quantitative transactional predictors, correlation-based screening is performed using Spearman rank correlations to accommodate potential nonlinearity and heavy-tailed distributions. All data-adaptive preprocessing steps—including winsorization thresholds, standardization parameters, and correlation-screening rules—are learned exclusively on the training sample ( $[eqn]$ ) and subsequently applied unchanged to the held-out test sample ( $[eqn]$ ), thereby ensuring strict prevention of information leakage.

Although the full Amazon dataset permits observation of technology and electronics spending across multiple brands within the Amazon channel during the evaluation window $[eqn]$ , information on non-Apple spending is used exclusively ex post as a channel-restricted benchmark for validation and contextualization of the model-implied SoW.

To ensure transparency and reproducibility of the empirical application, we provide a complete technical report in Supplementary File S2, including data pre-processing procedures, design matrices, coefficient estimates, and detailed catalogs of predictor and response variables. Reproducible implementations of the real-data application in Julia and R are provided in Supplementary File S3.

Against this methodological background, Table 1 and Table 2 summarize the main empirical patterns recovered by the pogit model under the full specification and after covariate selection, respectively. The full specification incorporates the complete set of covariates considered in the empirical analysis and reveals that most of the model’s explanatory content is concentrated in a relatively small subset of predictors related to Apple-specific transactional behavior and household characteristics—most notably RFM variables, income categories, and platform usage intensity—while the remaining covariates contribute limited additional explanatory power, motivating a more parsimonious specification to facilitate interpretation.

In the full specification (Table 1), variation in SoW is primarily driven by Apple-specific transactional behavior. Both Frequency and Monetary enter with positive and statistically significant coefficients (estimates $[eqn]$ and $[eqn]$ , respectively), indicating that customers who purchase Apple products more frequently and spend more on the focal brand allocate a larger share of their latent category demand to Apple, whereas Recency does not reach statistical significance once the full set of transactional and demographic controls is included. Heterogeneity in SioW is mainly associated with household income and platform usage intensity: the Income 100–149k category exhibits a positive and statistically significant association with latent purchase intensity (estimate $[eqn]$ ), while the highest income group does not reach conventional significance levels. Platform engagement plays a central role, with accounts shared by three individuals (Use = 3) displaying substantially higher latent category demand (estimate $[eqn]$ ). In addition, the life-event indicator Life: Became pregnant enters with a negative and statistically significant coefficient, suggesting a temporary reduction in latent purchasing intensity, whereas other demographic and life-event controls do not exhibit statistically significant effects.

In the parsimonious specification (Table 2), SoW is sharply characterized by Apple-specific behavioral variables in the allocation component. Frequency and Monetary remain the dominant determinants, with positive and statistically significant coefficients (estimates $[eqn]$ and $[eqn]$ , respectively), confirming the central role of repeated purchasing and cumulative spending in shaping wallet allocation toward the focal firm. In contrast to the full model, Recency now enters with a negative and statistically significant coefficient (estimate $[eqn]$ ), indicating that more recent Apple purchases are associated with a higher share of wallet once irrelevant covariates are removed. Under the same parsimonious specification, heterogeneity in SioW is primarily driven by household income and platform usage intensity: the Income 50–74 k and Income 100–149 k categories are positively and significantly associated with latent category demand (estimates $[eqn]$ and $[eqn]$ , respectively), while platform engagement remains a dominant factor, with accounts shared by three individuals (Use = 3) exhibiting a strong and highly significant positive association with intensity (estimate $[eqn]$ ). Consistent with the full specification, the life-event indicator Life: Became pregnant retains a negative and statistically significant effect, whereas other retained income and usage categories do not reach conventional significance levels.

7. Conclusions

This paper develops a scalable expectation-maximization algorithm for the Poisson-logistic (pogit) model, a classical framework for count data subject to partial observability. By exploiting a Polya-Gamma data augmentation, the proposed approach yields a quadratic complete-data log-likelihood and closed-form updates in both the E- and M-steps. As a result, each EM iteration is computationally inexpensive, numerically stable, and well-suited to large datasets.

The primary contribution is methodological. Unlike existing frequentist and Bayesian approaches, which typically rely on generic numerical optimization or Markov chain Monte Carlo methods, the proposed EM formulation scales naturally with sample size and covariate dimension. The algorithm admits deterministic full-batch updates, mini-batch variants based on Robbins–Monro stochastic approximation, and quasi-Newton acceleration via SQUAREM, all targeting the same observed-data likelihood. Simulation experiments confirm stable finite-sample behavior, robustness to the negative-binomial approximation underlying the Polya-Gamma construction, and substantial computational gains relative to direct maximum-likelihood optimization.

The proposed framework also suggests several directions for future research. First, the quadratic structure of the M-step makes the algorithm particularly amenable to regularization in high-dimensional settings. Penalized extensions based on $[eqn]$ penalties or global–local shrinkage priors, such as the horseshoe, could be incorporated either through penalized M-steps or additional latent-variable augmentations. Second, the Polya-Gamma construction naturally links the present EM approach to variational Bayes methods. In a related direction, exploring variational Bayes approximations built on similar augmentation schemes may yield fast approximate Bayesian procedures that complement the EM algorithm developed here.

More broadly, the results highlight the role of Polya-Gamma augmentation as a unifying computational device for efficient inference in models combining nonlinear link functions and discrete outcomes. By making frequentist estimation of pogit models feasible at scale, the proposed EM algorithm expands the practical applicability of these models in modern empirical settings.

Finally, it is worth noting that the proposed EM estimator targets the maximum–likelihood solution of the pogit model, and therefore, inherits the usual asymptotic properties of likelihood–based inference under standard regularity conditions. While Bayesian Polya-Gamma formulations naturally provide finite-sample posterior uncertainty, the present approach relies on asymptotic inference based on the observed Fisher information, yielding a simple and scalable frequentist alternative suited to large datasets.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Cameron A.C. Trivedi P.K. Regression Analysis of Count Data 2nd ed.Cambridge University Press New York, NY, USA 2013
2Glady N. Croux C. Predicting customer wallet without survey data J. Serv. Res.20091121923110.1177/1094670508328983 · doi ↗
3Stoner O. Economou T. Drummond Marques da Silva G. A hierarchical framework for correcting under-reporting in count data J. Am. Stat. Assoc.20191141481149210.1080/01621459.2019.1573732 · doi ↗
4Papadopoulos G. Immigration status and property crime: An application of estimators for underreported outcomes IZA J. Migr.201431210.1186/2193-9039-3-12 · doi ↗
5Polettini S. Arima S. Martino S. An investigation of models for under-reporting in the analysis of violence against women in Italy Soc. Indic. Res.20241751007102610.1007/s 11205-023-03225-3 · doi ↗
6Bradshaw C. Blei D.M. A Bayesian model of underreporting for sexual assault on college campuses Ann. Appl. Stat.2024183146316410.1214/24-AOAS 1928 · doi ↗
7Winkelmann R. Zimmermann K.F. Poisson-Logistic Regression Working Paper 93–18Department of Economics, University of Munich Munich, Germany 1993
8Dvořák M. Wagner H. Sparse Bayesian modelling of underreported count data Stat. Model.201516244610.1177/1471082 X 15588398 · doi ↗