Prepaid parameter estimation without likelihoods
Merijn Mestdagh, Stijn Verdonck, Kristof Meers, Tim Loossens, Francis, Tuerlinckx

TL;DR
This paper introduces a prepaid database approach for statistical inference that precomputes model outcomes, enabling rapid and accurate parameter estimation across various models and priors, significantly outperforming existing methods.
Contribution
The authors develop a prepaid database method that allows fast, accurate inference for intractable models by precomputing and interpolating model outcomes, facilitating resource sharing among researchers.
Findings
Achieves 23,000 to 100,000-fold speed improvements.
Handles models previously considered quasi inestimable.
Demonstrates effectiveness on three challenging models.
Abstract
In various fields, statistical models of interest are analytically intractable. As a result, statistical inference is greatly hampered by computational constraints. However, given a model, different users with different data are likely to perform similar computations. Computations done by one user are potentially useful for other users with different data sets. We propose a pooling of resources across researchers to capitalize on this. More specifically, we preemptively chart out the entire space of possible model outcomes in a prepaid database. Using advanced interpolation techniques, any individual estimation problem can now be solved on the spot. The prepaid method can easily accommodate different priors as well as constraints on the parameters. We created prepaid databases for three challenging models and demonstrate how they can be distributed through an online parameter estimation…
| version | ||||||
|---|---|---|---|---|---|---|
| 1 | / | 0.17 | 0.67 | 7.45 | 0.74 | |
| 1 | 100000 | 0.17 | 0.66 | 7.49 | 0.7 | |
| 1 | 100000 | 0.16 | 0.63 | 7.9 | 0.7 | |
| 1 | 500000 | 0.16 | 0.62 | 8.17 | 0.7 | |
| 1000 | 100000 | 0.07 | 0.35 | 6.41 | 0.61 | |
| 1000 | 500000 | 0.05 | 0.27 | 4.83 | 0.48 | |
| 1000 | 100000 | 0.03 | 0.23 | 5.24 | 0.42 | |
| 1000 | 500000 | 0.03 | 0.21 | 4.39 | 0.4 |
| r | |||
|---|---|---|---|
| 1.2 | 0.021 | 0.14 | |
| 0.43 | 0.0044 | 0.023 | |
| 0.54 | 0.013 | 0.091 |
| time | 716 s | 3549 s | 5841 s | (50000 s) | (500000 s) |
| times faster | 16273 | 80659 | 132750 | (1000000) | (10000000) |
| times faster | 194 | 959 | 1578 | (10000) | (100000) |
| r | ||||
|---|---|---|---|---|
| 0.9 | 0.89 | 0.93 | ||
| 0.94 | 0.92 | 0.94 | ||
| 0.92 | 0.91 | 0.92 | ||
| 0.95 | 0.84 | 0.97 | ||
| prepaid | 0.96 | 0.94 | 0.96 | |
| 0.97 | 0.95 | 0.97 |
| r | Time (in seconds) | |||
|---|---|---|---|---|
| 1.05 (1.01– 1.1) | 0.41 (0.31 – 0.51) | 248.17 (139.53 – 493.2) | 830 | |
| 1.10 (1.06– 1.34) | 0.43 (0.30 – 0.54) | 140.60 (43.94 – 208.19) | 0.2 | |
| 1.06 (1.01– 1.24) | 0.41 (0.21 – 0.56) | 176.15 (19.27 – 427.65) | 4 |
| estimated with | estimated with | estimated with | |||||||
|---|---|---|---|---|---|---|---|---|---|
| parameter | r | r | r | ||||||
| test set created with | 8.2 | 0.13 | 0.53 | 10 | 0.12 | 0.82 | 16 | 0.17 | 0.94 |
| test set created with | 10 | 0.13 | 0.55 | 6.5 | 0.072 | 0.43 | 11 | 0.12 | 0.60 |
| test set created with | 4.4 | 0.15 | 0.33 | 6.9 | 0.19 | 0.51 | 3.5 | 0.065 | 0.28 |
| prior | r | ||
|---|---|---|---|
| flat prior | 88 | 0.17 | 0.42 |
| prior Equation 12 | 61 | 0.11 | 0.36 |
| version | ||||||
|---|---|---|---|---|---|---|
| 1 | / | 0.11 | 0.45 | 1.4 | 0.45 | |
| 1 | 100000 | 0.1 | 0.39 | 0.96 | 0.38 | |
| 1 | 100000 | 0.1 | 0.4 | 1 | 0.4 | |
| 1 | 500000 | 0.1 | 0.38 | 1 | 0.39 | |
| 1000 | 100000 | 0.03 | 0.14 | 0.39 | 0.32 | |
| 1000 | 500000 | 0.02 | 0.09 | 0.27 | 0.22 | |
| 1000 | 100000 | 0.02 | 0.07 | 0.18 | 0.14 | |
| 1000 | 500000 | 0.01 | 0.07 | 0.17 | 0.15 |
| version | ||||||
|---|---|---|---|---|---|---|
| 1 | / | 0.97 | 0.97 | 0.99 | 0.96 | |
| 1 | 100000 | 0.84 | 0.87 | 0.86 | 0.86 | |
| 1 | 100000 | 0.94 | 0.95 | 0.95 | 0.94 | |
| 1 | 500000 | 0.94 | 0.95 | 0.94 | 0.94 | |
| 1000 | 100000 | 0.27 | 0.3 | 0.29 | 0.27 | |
| 1000 | 500000 | 0.47 | 0.5 | 0.48 | 0.48 | |
| 1000 | 100000 | 0.93 | 0.94 | 0.96 | 0.93 | |
| 1000 | 500000 | 0.96 | 0.95 | 0.96 | 0.95 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Prepaid parameter estimation without likelihoods
Merijn Mestdagh
KU Leuven – University of Leuven, Belgium
These authors contributed equally to this work.
Stijn Verdonck
KU Leuven – University of Leuven, Belgium
These authors contributed equally to this work.
Kristof Meers
KU Leuven – University of Leuven, Belgium
Tim Loossens
KU Leuven – University of Leuven, Belgium
Francis Tuerlinckx
KU Leuven – University of Leuven, Belgium
Abstract
In various fields, statistical models of interest are analytically intractable. As a result, statistical inference is greatly hampered by computational constraints. However, given a model, different users with different data are likely to perform similar computations. Computations done by one user are potentially useful for other users with different data sets. We propose a pooling of resources across researchers to capitalize on this. More specifically, we preemptively chart out the entire space of possible model outcomes in a prepaid database. Using advanced interpolation techniques, any individual estimation problem can now be solved on the spot. The prepaid method can easily accommodate different priors as well as constraints on the parameters. We created prepaid databases for three challenging models and demonstrate how they can be distributed through an online parameter estimation service. Our method outperforms state-of-the-art estimation techniques in both speed (with a 23,000 to 100,000-fold speed up) and accuracy, and is able to handle previously quasi inestimable models.
1 Author Summary
Interesting nonlinear models are often analytically intractable. As a result, statistical inference has to rely on massive, time-intensive, simulations. The main idea of our method is to avoid the redundancy of similar computations that typically occur when different researchers independently fit the same model to their particular dataset. Instead, we propose to pool computational resources across the researchers interested in any given model. The prepaid method starts with an extensive simulation of datasets across the parameter space. The simulated data are compressed into summary statistics, and the relation to the parameters is learned using machine learning techniques. This results in a parameter estimation machine that produces accurate estimates very quickly (a 23,000 to 100,000-fold speed up compared to traditional methods).
2 Introduction
Models without an analytical likelihood are increasingly used in various disciplines, such as genetics [2], ecology [31, 6], economics [19, 7] and neuroscience [28]. For models without an analytical or easily computable likelihood, parameter estimation is a major challenge for which a variety of solutions have been proposed [31, 2, 12]. All these methods have in common that they rely on extensive Monte Carlo simulations and that their convergence can be painstakingly slow. As a result, the current methods can be very time consuming.
To date, the practice is to analyze each data set separately. However, considering all the calculations that have ever been performed during parameter estimation of a particular type of model, for each different data set, one cannot help but notice an incredible waste of resources. Indeed, simulations performed while estimating one data set may also be relevant for the estimation of another. Currently, each researcher estimating the same model with different data will start from scratch, and not benefit from all the possibly relevant calculations that have already been performed in earlier analyses by other researchers, in other locations, on different hardware, and for other data sets, but concerning the same model.
Therefore, our proposal is to combine resources and create a giant online database of parameter values coupled to simulated data. Consequently, (cloud based) interpolation techniques and global optimization methods can be used on the previously created (hence, prepaid) database for accurate and fast parameter estimation on any device. Statistical analyses that currently take up hours to days of computation time on dedicated hardware are now available to everyone within seconds.
In Figure 1 we present a graphical illustration of the prepaid parameter estimation method. First (panel A), for a representative number of parameter vectors , large data sets are simulated, compressed into summary statistics (i.e., ) and saved — creating the prepaid grid. This prepaid grid is computed beforehand and the results are stored at a central location. Second (panel B1), the observed (data) summary statistics () are compared to the simulated (data) summary statistics (i.e., ) using an appropriate objective loss function and a number of nearest neighbor simulated summary statistics are selected. The loss function is related to the loss function used in the generalized method of moments [10] and method of simulated moments [8].
Third (panel B2), interpolation methods are used to find the relation between the parameter values and the summary statistics for the selected points of the previous step [9, 20]. In this paper, we use tuned least squares support vector machines, LS-SVM [25]. Finally (panel B3), the objective loss function , now using predicted summary statistics , is minimized as a function of the unknown parameter values using an optimizer.
Three important aspects of the prepaid method deserve special mention. First, the parameter space is required to be bounded. If this is unnatural for a given parametrization, then the parameters have to be appropriately transformed to a bounded space. Second, we typically start from a uniform distribution of parameter vectors in the final parameter space. This choice reflects on the uniformity of the grid’s resolution, but has no further implications provided the grid is sufficiently dense. Bayesian priors can be implemented without recreating the prepaid grid, since the prior can be taken into account in the loss function. Third, often a user is not interested in a single instance of a model, but rather has data from several experimental conditions that share some common parameters but assume other ones to be different. Also in these cases the prepaid grid does not need to be recreated, as the parameter constraints can be included through priors with tuning parameters (i.e., penalties).
The performance of the prepaid method can be studied theoretically in simple situations (see Methods 5.1). In what follows, the prepaid method will be applied to three more complicated, realistic scenarios.
3 Results
Example 1: The Ricker model
In a first example, we apply our prepaid method to the Ricker model [27, 31] which describes the dynamics of the number of individuals in a species over time (with to ):
[TABLE]
where . The variables (i.e., the expected number of individuals at time ) and are hidden states. Given an observed time series , we want to estimate the parameters , where is the growth rate, the process noise and a scaling parameter. The Ricker model can demonstrate near-chaotic or chaotic behavior and no explicit likelihood formula is available.
Wood [31] used the synthetic likelihood to estimate the model’s parameters. In the original synthetic likelihood approach (denoted as ), the assumed multivariate normal distribution of the summary statistics is used to create a synthetic likelihood. The mean and covariance matrix of this normal distribution are functions of the unknown parameters and are calculated using a large number of model simulations. The synthetic likelihood is proportional to the posterior distribution from which is sampled using MCMC and a posterior mean is computed.
Wood’s synthetic likelihood approach is compared to the prepaid method, where we create a prepaid grid of the mean and the covariance matrix of a similar set of summary statistics. Prepaid estimation comes in multiple variants, depending on the use of an interpolation method. The first, which uses only the prepaid grid points and chooses the nearest neighbor (maximum synthetic likelihood) as final estimate, will be called . The second, , uses LS-SVM to interpolate between the parameters in the prepaid grid to increase accuracy. The differential evolution algorithm (a global optimizer; [24]) is used to maximize this interpolated synthetic (log)likelihood. Additional details on the implementation of the synthetic likelihood can also be found in Methods 5.2.
Figure 2 shows both the accuracy of parameter recovery (as measured with the RMSE) and computation time for the three methods under comparison: (1) as in [31], the prepaid method (2) with (), and (3) without () interpolation. As can be seen in Figure 2, the prepaid estimation techniques lead to better results than the synthetic likelihood for , both in accuracy and speed. The method leads to some clear outliers (see Methods 5.2 ) which testifies to possible convergence problems (probably due to local minima). The prepaid method suffers much less from this problem. Most striking is the speed up of the prepaid method: The version of the prepaid estimation is finished before a single iteration of the 30,000 iterations in the synthetic likelihood method has been completed — 100,000 times faster. In addition, it is demonstrated that the coverages of the prepaid method confidence intervals are very close or exactly equal to the nominal value (we look at 95% bootstrap-based confidence intervals). SVM interpolation is mainly helpful for large , where one expects a higher accuracy of the estimates and the grid is too coarse. The analyses with large could only be completed in a reasonable time using the prepaid method (See Methods 5.2 for more detailed information).
In the application above, the tacitly assumed prior on the parameter space is uniform. In addition, there is only one data set for which a single triplet of parameters needs to be estimated. In Methods 5.2, we show how both limitations can be relaxed. First, it is explained how different priors for the Ricker model can be implemented. Second, it is discussed what can be done if there are two data sets (i.e., conditions) for which it holds that and but and are not related.
Finally, we also tested our estimation process on the population dynamics of the Chilo partellus, extracted from Figure 1 in Taneja and Leuschner [26, 32]. Here we found that (95% confidence interval 1.06– 1.34), (95% confidence interval 0.30 – 0.54) and (95% confidence interval = 43.94 – 208.19). We found similar results using the synthetic likelihood method (see Methods 5.2), but our estimation was 4000 times faster.
Example 2: A stochastic model of community dynamics
A second example we use to illustrate the prepaid inference method is a trait model of community dynamics [14] used to model the dispersion of species. For this model (see also Methods section), there are four parameters to be estimated: , , , and . As with the first application, there is no analytical expression for the likelihood [14].
As an established benchmark procedure for this trait model, we apply the widely used Approximate Bayesian Computation (ABC) method [4, 30, 23, 1] as implemented in the Easy ABC package and denoted here as (PM stands for posterior means, which will be used as point estimates) [16]. As priors, we use uniform distributions on bounded intervals for , , and (see Methods 5.3 for the exact specifications), but this can be easily changed as explained for the first example.
To allow for a direct comparison with the ABC method, and to illustrate the versatility of the prepaid method, we have also implemented three Bayesian versions of the prepaid method. The first, , creates a posterior proportional to the prepaid synthetic likelihood. The second, , saves not only, the mean and covariance matrix of the summary statistics for every parameter in the prepaid grid, but also a large set of uncompressed summary statistics. Using these statistics we are able to approximate an ABC approach. The third, , again interpolates between the grid points to achieve a higher accuracy.
All methods result in accuracies of the same order of magnitude as can be seen in Table 1. The main difference is again the speed of the methods: is about 23,000 times faster than traditional ABC. For small sample sizes, all ABC based methods achieve good coverage. However, for large sample sizes, cannot be used anymore (because of the unduly long computation time). For the prepaid versions, it is necessary to use SVM interpolation between the grid points to get accurate results.
Example 3: The Leaky Competing Accumulator for choice response times
In a third example, we apply our method to stochastic accumulation models for elementary decision making. In this paradigm, a person has to choose, as quickly and accurately as possible, the correct response given a stimulus (e.g., is a collection of points moving to the left or to the right). Task difficulty is manipulated by applying different levels of stimulus ambiguity.
A popular neurally inspired model of decision making is the Leaky Competing Accumulator (LCA[29]). For two response options, two noisy evidence accumulators (stochastic differential equations, see Methods section) race each other until one of them reaches the required amount of evidence for the corresponding option to be chosen. The time that is required to reach that option’s threshold is interpreted as the associated choice response time. For different levels of stimulus difficulty, the model produces different levels of accuracy and choice response time distributions. The evidence accumulation process leading up to these choices and response times is assumed to be indicative of the activation levels of neural populations involved in the decision making.
As in the first two examples, there is no analytical likelihood available that can be used to estimate the parameters of the LCA. Moreover, the LCA is an extremely difficult model to estimate. To the best of our knowledge, only [21] systematically investigated the recovery of the LCA parameters, but for a slightly different model (with three choice options) and with a method that is impractically slow for very large sample sizes, making it difficult to show near-asymptotic recovery properties with.
For an experiment with four stimulus difficulty levels, the LCA model has nine parameters. However, after a reparametrization of the model (but without a reduction in complexity), it is possible to reduce the prepaid space to four dimensions (see Methods 5.4) and conditionally estimate the remaining subset of the parameters with a less computationally intensive method. Three variants of the prepaid method have been implemented: taking the nearest neighboring parameter set (based on a symmetrized distance between distributions) on the prepaid grid (), averaging over the grids nearest neighboring parameter sets of 100 non-parametric bootstrap samples (), using SVM interpolation for every bootstrap estimate (). A nearest neighbor or bootstrap averaged estimate completes in about a second on a Dell Precision T3600 (4 cores at 3.60GHz), an SVM interpolated estimate requires a couple of minutes extra.
Figure 3 displays the mean absolute error (MAE) of the estimates for four of the nine parameters as a function of sample size, separately for three estimation methods. The results for the other parameters are similar and can be consulted in the Methods section. It can be seen that with increasing sample size, MAE decreases. The SVM method pays off especially for larger samples. Figure 4 shows detailed recovery scatter plots for a subset of the parameters for 1,200 observed trials, which is the typical size of decision experiments. To get better recovery, larger sample sizes have to be considered (see Methods section). In general, recovery is much better than what has been reported in [21]. The coverage of the method, based on non-parametric bootstrapping, is satisfactory for all sample sizes, provided SVM interpolated estimates are used for . In addition, we do not find evidence for a fundamental identification issue with the two option LCA, as has been stated in [21].
4 Discussion
In three examples, we have demonstrated the efficacy and versatility of the prepaid method. The prepaid method is at least as accurate as current methods, but many times faster (23,000 to 100,000-fold speed up). Besides the improvements at the level of speed and accuracy, the prepaid method has a number of other distinct advantages. First, the prepaid method can be used for a very large number of observations, contrary to the synthetic likelihood or ABC methods. The use of very large simulated data sets allows investigation of large-sample properties of the estimator, which is a problem for the synthetic likelihood and ABC. Second, because of the enormous speed improvement and having data sets available across the whole parameter space, the prepaid method allows for fast yet extensive testing of recovery of simulated data across this space — the recovery of every single parameter set can be evaluated. Such a practice leads to detailed internal quality control of the used estimation algorithm.
Although the idea behind the prepaid method is fairly simple, we want to anticipate a few misconceptions that might arise. First, as has been demonstrated in the context of the Ricker model, the prepaid method can easily deal with different priors and with equality constraints on parameters, without the need to recreate the underlying prepaid grid. Second, the observed data based on which the model parameters have to be estimated can be of any size, again without the need to recreate the prepaid grid for each and every sample size.
Ideally, the prepaid databases and the corresponding estimation algorithms will be constructed and made available by a team of experts for the model at hand. Subsequently, a cloud based service can then be set up to offer high quality model estimations to a broad public of researchers. As an example, we created such a service for the Ricker model in Equation 1: www.prepaidestimation.org, where we allow the user to estimate the parameters of the Ricker model for personal data as well as 4 example data sets including one real life data set [26, 32]. By using such a cloud based service, researchers that need their data analyzed with computationally challenging models, can avoid many of the pitfalls they would otherwise encounter venturing out on their own. This practice will also lead to increased reproducibility of computational results.
A first possible objection against the prepaid method is the considerable initial simulation cost (for the examples discussed, prepaid simulations took up to a couple of days on a 20-core processor). However, this overhead cost will dissipate entirely as increasingly more estimates are sourced from the same prepaid database. Moreover, the initial prepaid cost can be easily distributed across multiple interested parties. Further, because the database can be used for internal quality control, additional simulation studies investigating the recovery of parameters are made redundant.
A second possible objection is that the prepaid grid, unsurprisingly, does not escape the curse of dimensionality: The grid size grows exponentially with the number of parameters. The prepaid method is most effective for highly nonlinear models with substantively meaningful parameters, as they appear in various computational modeling fields. Thus, the number of parameters cannot be very large. However, this limitation can be alleviated in a number of ways. First, the use of interpolation techniques allows for a substantial reduction of the number of prepaid points (by a factor of five for the same accuracy in the trait model example; see Methods section). Second, as is shown in the LCA example, it is possible to only partially apply the prepaid method, estimating with normal techniques, the less challenging parameters conditionally on a prepaid grid of the more intricately connected ones. Third, as shown by tackling three challenging examples, current storage and throughput possibilities can accommodate realistically sized prepaid databases.
It is our strong belief that this method will massively democratize the use of many computationally expensive models, which are now reserved for people with access to specific high-end hardware (e.g., GPUs, HPC). Apart from such democratization, this approach could significantly impact the current work flow of scientific modeling, in which every part of the estimation is carried out locally by an individual researcher.
5 Methods
5.1 A toy example: Estimating the mean of a normal
For a very simple setting, we want to study the performance of the prepaid methods analytically.
Assume () with the mean unknown (and to be estimated and the standard deviation known (so number of parameters ). The observed mean is denoted as . We will explore two situations. In the first situation, will be our summary statistic (hence number of summary statistics ) to estimate ( is also a sufficient statistic for ). In the second situation, we will study what happens if is chosen to be the summary statistic.
Situation 1:
As a prepaid grid, we take evenly spaced -values with spacing or gap size . For each value , values of are simulated and the sample average is computed (i.e., ). Typically, or larger. Hence, every value of is paired with a particular : .
Given an observed , the nearest neighbors of simulated statistics are selected: , , , , such that . Typically, .
Because of the linearity of the problem, we can safely assume that if is large enough, the selected values are all consecutive or nearly consecutive (because of noise in the prepaid simulation of , it can happen that the selected values are not consecutive). We denote the average of these -values as . If all values are exactly consecutive, can be expressed as
[TABLE]
where we have defined as
[TABLE]
In addition (assuming that all values are exactly consecutive), their variance is given by
[TABLE]
Hence, their standard deviation is .
Using the pairs, we assume as a linear interpolator in this example a linear regression model that links the simulated statistics to the true underlying : , with . Obviously, and .
Given , selected prepaid points and the fitted linear regression model, we know from linear regression theory that:
[TABLE]
where 0 and 1 are the true and and
[TABLE]
The distribution is assumed to hold for repeated simulations of the replicated statistics in the prepaid grid.
Because we work with linear regression, the optimization problem is simple. In this case, the optimal value of for a given can be found by inverting the regression line:
[TABLE]
Next, we can study the properties of . We begin by calculating the conditional mean and conditional variance . Hence, we treat the observed data (or sample average) as given and fixed. These expectations are taken over different simulations of ’s in the prepaid grid. Before giving the expressions, it is useful to note that
[TABLE]
Now, using the approximations given in [22] for ratios of random variables, we find that:
[TABLE]
and
[TABLE]
Invoking the double expectation theorem to arrive at the unconditional expectations, we have:
[TABLE]
where , that is, the difference between the true value and the mean of the selected nearest neighbors ’s. Likewise, we can derive:
[TABLE]
From Equation 2, we learn that if there is no systematic deviation in the selection of -grid points, the prepaid estimator is unbiased. In the other case, the is bias decreases with but is proportional to . In Equation 3, the leading term of the variance is , which is the same as in classical estimation theory. For the other terms, they all have (or a power of it) in the denominator. Because is usually quite large, these terms tend to be in general of lesser importance. However, some terms also have both (the number of selected nearest neighbor grid points) and (the gap size). It is worthwhile to note that increasing the resolution (i.e., decreasing ), while keeping constant, will increase the additional terms and thus add to the error. The reason for this is that the interpolation is defined on a too small grid, leading to uncertainty in the estimated regression. This effect is illustrated in the left panel of Figure 5 in which the root mean square error (RMSE) is shown for the estimation of for different values of and . The plot is constructed by means of a simulation study, but confirms our analytical results.
Situation 2:
In the second situation, we will again estimate (the unknown mean of a unit variance normal), but in this case is used as a statistic. Thus, the relation between the simulated statistics and is quadratic (and thus nonlinear). Again we use a local linear approximation. Clearly, this approximation will only be approximately valid if we do not choose the area of approximation too large. However, unlike in the first situation, we do expect an additional effect of the approximation error.
No analytical derivations were made for this case, but we conducted a similar simulation study as in situation 1. The results (in terms of RMSE) are shown in the right panel of Figure 5. As can be seen, there is a clear optimality trade-off visible between and . This can be explained as follows: Fix and then consider the gap size . If is too small, we get a similar phenomenon as in the left panel, that is a large RMSE. However, if we take too large, then the approximation error will dominate (because the linear interpolation misfits the quadratic relation). The optimal point will be different for different .
This toy example demonstrates the sound theoretical foundations of the prepaid method in well-behaved situations. However, the question is how well the method performs for real life examples. Three hard problems will be studied next.
5.2 Application 1: The Ricker model
The basic model equations of the Ricker model is given in Equation 1.
Synthetic likelihood estimation
For the synthetic likelihood estimation (), we made use of the synlik package [5]. The synthetic likelihood for a data set with summary statistics and a certain parameter vector is given by
[TABLE]
where and are the estimated mean and covariance of the summary statistics when Equation 1 is simulated multiple times with parameter .
The statistics used by the synthetic likelihood function were the average population size, the number of zeros, the autocovariances up to lag 5, the coefficients of the quadratic linear autoregression of and the coefficients of the cubic regression of the ordered differences on the observed values.
For each data set we used the synthetic likelihood Markov chain Monte Carlo (MCMC) method with 30000 iterations, a burn in of 3 time steps and 500 simulations to compute each and [5]. We used the following prior:
[TABLE]
The synlik package generates the MCMC chain on a logarithmic scale, we estimated the parameters as the exponential of the posterior mean. To ensure convergence, only the last half of the chain is used (the last 15000 iterations).
Creation of the prepaid grid
For the prepaid estimation, we used the same summary statistics as for the traditional synthetic likelihood, except for two differences. First, the coefficients of the cubic regression of the ordered differences on the observed values could not be used, because the observed values are not available when creating the prepaid grid. Second, we changed the number of zeros to the percentage of zeros to make this statistic independent of (as this may change depending on the observation).
We filled the prepaid grid with 100000 parameter sets using the priors of Equation 5. To cover this grid as evenly as possible (and avoiding too large gaps), the uniform distribution was approximated using Halton sequences [18, 17]. For each parameter set in the prepaid grid, we simulated a time series of length and used the summary statistics of this long time series as .
Each time series was then split into series of length 100,1000 and 10000 which were used to compute the covariance for the statistics computed on data of these lengths. This means, for example, that we had 100000 series of length 100 to compute the covariance matrix for a certain parameter set for time series of length 100. If we need to estimate parameters of a time series with not equal to one of the lengths, we use the covariance matrix created with time series of length which is closest to in logarithmic scale and adapt the covariance matrix into
[TABLE]
The creation of the prepaid grid took approximately one day on a 3.4GHz 20-core processor.
To allow the estimation for a bigger range of parameters for the online estimation at www.prepaidestimation.org we created a new and bigger prepaid grid using the following priors:
[TABLE]
We filled to prepaid grid with 100000 parameter sets and used this prior for the real life data set.
Prepaid estimation
Four variants of prepaid estimation were implemented for this example. All use the negative synthetic likelihood as distance ( as defined in the main text and Figure 1). First, we do a nearest neighbor estimation , without using any interpolation between the grid points of the prepaid data set. We compute the synthetic likelihood of all the prepaid parameters for the summary statistics of the test data set. The parameter vector with the highest likelihood, the so-called nearest neighbor may already be a good estimation. For a low number of time points , it is to be expected that the error on the parameter estimate is much larger than the gaps in the prepaid grid, and in such a case, the estimation approach suffices.
Second, a more accurate estimation can be acquired by interpolating between the parameter values in the prepaid grid (). Therefore, we learn the relation between the parameters and the summary statistics: . However, we only learn this relation in the region of interest, that is the 100 nearest neighbors according to the synthetic likelihood. For each summary statistic, we create, on the fly, a separate least squares support vector machine (LS-SVM) [25] using the 100 nearest neighbors. This machine learning technique is chosen as it is a fast non-linear method which generalizes well. We limit the predictions to the possible range of the summary statistics (e.g., to prevent a percentage of zeros, one of the statistics, larger than 1).
We then use the differential evolution global optimizer [24] to find the maximum of:
[TABLE]
where is the covariance matrix of the statistics of the nearest neighbor as defined in Equation 6. The superscript "PP" is used to denote that we use the prepaid version of synthetic likelihood, and not the traditional version as used by [31] (see Equation 4). The optimization process is constrained and we use the minima and maxima for each parameter of the 100 nearest neighbors as effective bounds.
The approach makes use of a non-linear black box interpolator. However, we may also consider using a much faster linear regression (see also the toy example in Section 5.1). Therefore, we will also compare the (and ) approach to a third option where we predict the summary statistics using a linear regression (called the approach).
Third, we can easily implement a prior for the likelihood in Equation 4. This leads to a posterior given by
[TABLE]
The parameters will be estimated as the maximum a posteriori (MAP), as comparison to maximum likelihood estimation which is a maximum a posteriori with a uniform prior. Here we will apply this extension to the nearest neighbor estimation: .
Lastly we will show that our prepaid method can also be used to cover multiple experimental set ups. Each experimental set up involves multiple conditions and may have varying constraints on the parameters of the model over these conditions. If, for one experimental set up, the conditions are independent, the likelihood of the whole experiment is
[TABLE]
where is the synthetic likelihood for condition . This is equivalent to estimating each parameter set individually for each condition . If the conditions are not independent and we assume that some parameters are the same over conditions we adapt the prior to mimic these assumptions. For example, if we assume that the first parameter is the same over all conditions we formulate this as
[TABLE]
where is the standard normal distribution and is the average of all . The smaller the tuning parameter , the more all will be forced to be equal. If is too large the estimation will not take into account the interdependence between the conditions. However, if it is too small we run into trouble with the sparsity of the prepaid grid. In the limit, where goes to zero, one point is chosen from the prepaid grid leading to equal parameters not only for first parameter but also for the others. can be easily tuned by simulating the experimental set up for a certain prepaid grid.
To further illustrate, we will apply this method to the Ricker model, assuming two conditions across which and stay the same such that this prior is given by
[TABLE]
We first use the nearest neighbor approach to find the 1000 nearest neighbors for condition one and two separately and then we refine the parameters using prior 12. As we assume and to be constant over the conditions, we take and as final estimates for these parameters in each condition.
Test set
As a test set we first used 100 random parameters created with the prior of Equation 5. To avoid problems with the borders we deleted parameters that where within 1% range of the bounds. We simulated data sets for . For each data set we estimated parameters using the nearest neighbor () and the approach. For , we also estimated the parameters using the approach. Due to time constraints, we only estimated parameters for the data with using the traditional synthetic likelihood approach.
Next we also created test data sets from different priors for . Prior from Equation 5 can also be written as
[TABLE]
where is a beta distribution with parameters and . Similarly, we created a test set from prior
[TABLE]
and prior
[TABLE]
We will test if performs best when the correct prior is used in the estimation process. Last we also created a test set for for an experimental set up with two conditions where and are equal over the conditions.
Results
For the results, we will evaluate the methods on the following criteria: accuracy, speed, and coverage.
Accuracy
To start off, we look at the recoveries for for all 100 simulated data sets and the three methods (, and ). Scatter plots are shown in Figure 6. It can seen that the synthetic likelihood estimation leads to some clear outliers. One possible reason for the absence of outliers in the prepaid estimation is the fact that prepaid estimation from the start examines the whole grid and therefore has less problems with getting stuck in local optima.
More generally, we plotted the accuracy of each of the methods as a function of time series length in Figure 7. The left panel shows the root mean square error (RMSE), while the right panel shows the median absolute error (MAE). We decided to look at the MAE because the few outliers for (which were shown Figure 6) may inflate the RMSE of the synthetic likelihood disproportionally, which happens to a certain extent. However, very similar conclusions can be drawn for both performance measures. In general, accuracy increases when increases (i.e., both RMSE and MAE decreases). For RMSE, our SVM prepaid method clearly outperforms the traditional synthetic likelihood method for every and every parameter. For , also the prepaid approach leads for every parameter to a lower RMSE compared to the synthetic likelihood. For all , the prepaid leads to a higher accuracy compared to the prepaid and this difference becomes larger for a larger . For MAE, the prepaid method and the original synthetic likelihood show a very similar accuracy (for ). Both outperform the prepaid.
The largest attainable accuracy for the prepaid approach is limited by the spacing of the prepaid grid. If we had created an equally spaced grid of points using the prior in Equation 5, we would have the following gaps in each of the three parameter dimensions:
[TABLE]
We do not have an equally spaced grid, but it is expected that the quasi Monte Carlo distribution of points creates expected gaps close to the ones in Equation 16. Therefore, it is no coincidence that the best possible RMSE using the prepaid approach has the same order of magnitude as the gap size , as can be seen in Table 2 for the case of . However, Table 2 also show that the prepaid approach leads to a much lower RMSE. The difference between the and the prepaid approach for is further visualized in Figure 8.
The results in Table 2 also show the need for a non-linear interpolator for the prepaid method. The RMSE of a linear regression interpolator () is much larger than that of the SVM prepaid.
In sum, we can conclude that the prepaid estimation methods lead to better, or at least similar, results as the traditional synthetic likelihood.
Speed
The largest improvement of the prepaid method over synthetic likelihood is in computational speed: The prepaid method is many times faster than synthetic likelihood. Consider Figure 2 in the main text where it is shown that the prepaid method is finished before a single iteration of the 30000 iterations are done by the method. While the and the prepaid methods are finished in respectively 0.044 and 3.7 seconds, independent of the time series length , the method grows slower with an order of magnitude of . In each iteration one needs to simulate multiple time series with length . The larger , the slower the estimation. While the synthetic likelihood needs approximately one and a half hour to estimate the parameters for a time series with length . The prepaid estimation still finishes in 0.044 s, which is more than times faster. The speed up factors are presented in Table 3 and as can be seen from Figure 7, there is not loss of accuracy. The speed up would reach millions, if we had the time to run the synthetic likelihood method for longer time series.
Coverage
Next, we look at the coverage rates of the confidence intervals as obtained with the bootstrap in combination with the prepaid method. To estimate a confidence interval of the estimates for the prepaid method, a parametric bootstrap with bootstrap samples was used.
For the prepaid version the estimate for the observed data set was obtained using the approach and the bootstrap estimates were commonly obtained using the prepaid method applied to the bootstrap data sets. However, if in the first 100 bootstraps only half of the nearest neighbors where unique points, the bootstrap distribution could be considered questionable. This behavior is to be expected for larger sample sizes , because the true bootstrap distribution is very peaked so that every bootstrap sample will have the same nearest neighbor grid point. When this occurs, we would estimate the parameters of each bootstrap using differential evolution, using the SVM created by the original 100 nearest neighbors.
Alternatively, for the synthetic likelihood approach (using MCMC) we computed the confidence interval by calculating the 0.025 and 0.975 quantiles of the last half of the posterior samples.
The coverage results for the test set of 100 parameters are shown for three different values of in Table 4. It can be seen that for both methods, the coverage is close to the nominal level of , but the coverage of the prepaid method is slightly better.
Prior
In this paragraph we show how we can benefit from using the correct prior. We estimate the parameters of the three testsets for , created with uniform prior from Equation 9 and beta distribution priors and from Equations 14 and 15. We estimated all three data sets using maximum a posteriori estimation using all three priors. The results are shown in Table 6. Using the correct prior leads, as expected, to the best results.
Parameter constraints across conditions
We estimated the parameters for a two condition experimental set up with equal and , with and without the prior from Equation 12 (parameter was tuned on 100 similar simulated data sets). The results are shown in Table 7. Using the prior from Equation 12, which implements the parameter constraints of the experimental set up, leads, as expected, to better results for each parameter. Even for , which is absent in the prior, we find better results.
Real life data set
The results for the estimation of the population dynamics of the Chilo partellus [32, 26], using the prior from Equation 7 can be found in Table 5. For the prepaid, we estimated the parameters using the methods online at www.prepaidestimation.org. All estimations are similar and have overlapping confidence intervals. The prepaid estimation is however significantly faster.
5.3 Application 2: A stochastic model of community dynamics
A second model we will apply our prepaid modeling technique to, is a stochastic dispersal-limited trait-based model of community dynamics [14]. The data that will be modeled, are the abundances of species (hence a vector of frequencies, in which each component is a different species). Each species in the local environment is assumed to have a competitive value dependent on its trait , given by the filtering function
[TABLE]
Here is the maximal competitive advantage, is the optimal trait value in the local environment and describes the width of the filtering function. At each time step, one individual from the local community dies. It is then replaced with a probability by a random descendant from the local pool. Here, is the size of the local community and is the fourth parameter to estimate, related to the amount of immigration from the regional pool into the local community. The probability that this descendant comes from a certain individual in the local community, is proportional to the competitiveness of this individual. With a probability of , the dead individual is replaced by an immigrant from the regional pool. The distribution of traits of the individuals in the regional pool is assumed to be uniform over . It is noteworthy that Jabot saw the necessity of reusing ABC simulations to reduce computation time in his recovery study [14].
The model was simulated using the C++ code from the Easy ABC package [16] where a regional pool of species was defined evenly spaced on the trait axis (i.e., the resolution) and was the size of the local community.
ABC estimation
We compare our prepaid method estimation with the Easy ABC package () [15, 16]. Because we work in a Bayesian framework, we first have to specify priors. As in Jabot et al. we use the following priors [16]:
[TABLE]
In this application, the parameter vector is defined as follows: . To get the ABC algorithm to work, we compute four summary statistics: the richness of the community (number of living species), Shannon’s index which measures the entropy of the community, and the mean and the skewness of the trait distribution of the community.
The ABC algorithm we use applies a sequential parameter sampling scheme [3]. The sequence of tolerance bounds is given by and the algorithm proceeds to the next tolerance after 200 simulations which lead to summary statistics within the bounds. The last 200 simulations within the bounds represent the posterior, and the estimate of the parameter is given by the posterior mean.
Creation of the prepaid grid
For the prepaid estimation, we used exactly the same summary statistics as the Easy ABC package. We filled the prepaid grid with parameter vectors using the priors of Equation 18, but for most examples we will use a grid with only parameter vectors. To cover this grid as evenly as possible, the uniform distribution was approximated using Halton sequences [18, 17] (in order to avoid gaps that may appear when Monte Carlo samples are used). The creation of the prepaid grid with parameter vectors took approximately 3 days on a 3.4GHz 20-core processor.
For the community dynamics models from Equations 17 and 18, there are several ways to simulate an almost infinitely large data set to achieve stable summary statistics. The first way is to increase the number of species and the size of the local pool . Unfortunately some summary statistics (the richness and the entropy) are in some unknown way dependent on these parameters. As a result, the summary statistics of a simulation with cannot be used to estimate the parameters for a setting where . Therefore, we chose to fix the size of the local pool and the number of species . It is very well possible that there are summary statistics which do not have this problem, making the prepaid grid much more universal. We chose however, for the sake of comparison with the easy ABC package to keep using these parameters.
A second way to simulate data with a very large sample size is by increasing the number of time steps. By estimating the summary statistics after each time step, when one individual from the local community dies and is replaced by another individual, we create a time series of summary statistics. Averaging the summary statistics over a sufficient large number of time points will lead to stable average values of these summary statistics. In our simulations, we applied some tinning by calculating the summary statistics every time after 500 species have died (the size of the community). The reasons is that there is not enough of variation in the summary statistics computed after the death of a single species. Next, we created time series of length ( species will have been replaced) for the prepaid grid and used the average of these summary statistics as . Using this time series we also computed for . is of course the setting for which the original trait model is described and for which the Easy ABC algorithm is tested. Additionally we also saved 1000 samples of time series of length .
Prepaid estimation
Contrary to the first application (the Ricker model), where we used a frequentist approach, for this community dynamics model we will follow a Bayesian approach. In Bayesian statistics, the focus is on the posterior distribution of the parameters , which is defined as follows:
[TABLE]
where is the likelihood and the prior. As the likelihood, we will use the synthetic likelihood , where is the synthetic log-likelihood as defined in Equation 4 (based on the vector of summary statistics ). Because we compress the data into summary statistics, the posterior we work with is actually an approximation to the true posterior: (in case the summary statistics are sufficient statistics for , the approximation sign becomes an equality sign). The distributions from Equation 18 are the priors for the parameters.
We have studied three variants of a Bayesian version of the prepaid method. These three versions will be discussed here in increasing order of complexity. We will denote the three variants as follows: , , and .
Because the priors are all uniform (and our prepaid grid is distributed following this prior), the posterior for a data set with summary statistic at parameter of the prepaid grid is proportional to
[TABLE]
where is the prepaid synthetic likelihood (i.e., with the mean statistics computed for a very large sample and a approximate covariance matrix given by Equation 6). The posterior mean (PM) using prepaid synthetic likelihood can be estimated as:
[TABLE]
The prepaid synthetic likelihood approach works best if the assumption of normally distributed summary statistics is not too far off. However, as can be seen in Figure 9, this is not always the case for the trait model defined in Equation 17. Therefore, as an alternative procedure, we propose an Approximate Bayesian Computation (ABC) approach. First, we select a subset of nearest neighbors from the prepaid set, such that for every , the synthetic likelihood value is highest and so that
[TABLE]
where the sum in the denominator runs across all grid points. In a sense, these are all the prepaid points in the expected coverage according to the posterior of Equation 20. We denote the cardinality of as .
In a next step, we basically perform ABC with all the grid points belonging to the selected subset . However, there is an important issue we cannot overlook. When doing ABC, for a given parameter vector new data are simulated of the same size as the observed data. Unfortunately, our prepaid grid has correspondingly only very large data sets. To rectify this problem, so that ABC can applied without problems, we simulated during the construction of the prepaid grid, a set of prepaid samples for several designated sample sizes (i.e., ). Let us denote with the vector of statistics for prepaid grid point , the th simulation (with ) and sample size .
Now, we can apply ABC to arrive at the posterior for ; the method will be denoted as . For now we will assume that is equal to one of the lenghts. We select the 1000 samples from this samples set that have the smallest Mahalonobis distance to the observed set of statistics :
[TABLE]
here is given by the covariance over all grid points in and over all 1000 replications (thus, ). The finally selected 1000 samples are then considered as a sample from the posterior. Note that the method does not require us to progressively strengthen the tolerances, as in traditional (governed by the tolerance parameter ). If the observed sample size is not equal to one of the lengths, then one can use the samples for length which is closest to in logaritmic scale and later adjust the posterior samples such that the posterior mean stays the same, but the posterior covariance matrix changes to
[TABLE]
We advise to save samples for enough different such that this correction is only marginal.
The is only based on the raw prepaid grid points. But again, a more accurate estimation can be found by interpolating between the parameters in the prepaid grid. Therefore, we learn the relation between the parameters and the summary statistics using LS-SVM: . We only learn this relation in the region of interest, that is, only the 100 nearest neighbors according to the approach or more specifically, the 100 prepaid points for which the most samples lead to a small enough .
Before we use machine learning to infer the relation we cluster these 100 nearest neighbors using hierarchical clustering such that no cluster has more than 50 prepaid points. This is necessary as these 100 nearest neighbors may come from totally different areas in the prepaid grid. This is illustrated in Figure 10.
For each cluster, we first make sure that at least 20 points are included (if not, we add points from the prepaid grid which are closest). Then we estimate the using LS-SVM for each cluster separately, giving rise to . Next, we find the minimum volume ellipse encompassing all the points in each cluster. These ellipses inform us about the areas for which the relation holds. Subsequently we resample parameters in each ellipse to zoom in more and more to the regions of interests. In detail, we do the following in every cluster :
Uniformly sample 1000 points in the minimum volume ellipse for cluster . We create a finer grid for each elipse. 2. 2.
Find the summary statistics based on the LS-SVM in cluster : 3. 3.
Find for each point the nearest point from the prepaid points with which this particular cluster was created 4. 4.
Translate the 1000 samples from the nearest point to the newly sampled point and add to each sample the difference in summary statistics: . In this step we aproximate a distribution of statistics for around . 5. 5.
Keep the points for which from Equation 23 is among the 5000 smallest distances and remove all others. 6. 6.
Recalculate the minimum volume ellipse with the new points. 7. 7.
Go back to step 1, until the worst does not decrease any more.
Broadly speaking, in step 1, we sample parameters , in step 2 to 4 we approximate the summary statistics distribution for each using LS-SVM and in step 5 to 7 we trim this set of parameters to only keep the parameters with a high posterior probability.
In the end we combine all the samples, we build the posterior with the parameters from the 1000 best samples over all clusters according to Equation 23. Note that some parameters may show up several times in this posterior sample. To compute the posterior mean, we use a weighted version of these samples. The weights are given by the volume of the ellipse from the cluster where they were created. This is necessary to ensure the correct use of the equal prior for all clusters.
Test set
To generate the test set, we follow the same logic as in [14]. We use the prior in Equation 18 to generate 1000 random parameter sets, except for , where we changed the prior with the following generating distribution:
[TABLE]
such that 0 and 100 are the true minimum and maximum optimal trait values for communities. By taking the prior for as in Equation 18, we avoid boundary effects. To exclude other problems at the borders of the parameter space, we deleted parameters which where within 1% range of the bounds. We simulated data sets for both and .
Results
Accuracy
Let us first look at the results for . We have used traditional ABC (), prepaid Bayes approach based on the synthetic likelihood () and prepaid ABC based on separately generated samples at the grid points ( and ). We have used and prepaid grid points. The RMSE and MAE can be found in Tables 1 and 8. All methods result in accuracies that are equally large. For 3 out of 4 parameters (except for ), the prepaid method outperforms with respect to RMSE. For MAE, the prepaid method uniformly outperforms the Easy ABC package (). Overal, the difference between and prepaid grid point is very small for the prepaid methods.
We have refrained from interpolating with the LS-SVM because the coverage includes on average more than 1000 points. This is perfectly logical because does not provide a lot of information, and, as a consequence, there is a lot of uncertainty (which translates itself into a large number of parameter points that have a reasonable large synthetic likelihood value). As a result, creating a posterior based on only 100 nearest neighbors (even after interpolation) would not suffice because too many parameter points with high posterior density would be missed.
For (see again Tables 1 and 8), the accuracy increases, as is expected (this can be seen both in the RMSE as in the MAE). In this case, both increasing the number of grid points and using LS-SVM interpolation increases accuracy. No results are given for , because it is impossible to fit the model with this sample size in acceptable time.
Speed
For , the estimation time of is 3865 s. In contrast, the estimation time of is 0.167 s. This means that the prepaid ABC method is approximately 23000 times faster than traditional ABC.
Coverage
For both the as well as the prepaid versions we end up with a posterior sample. We computed the coverage by calculating the 0.025 and 0.975 quantiles of the posterior samples. Next, we checked whether the true parameter was in this interval or not. Note that when we use clustering during , we weigh each point proportional to the volume of its originating cluster. For the approach we use the whole prepaid set as posterior and us weights according to Equation 20.
For and , coverage results can be found in Table 9. For , leads to better coverages than . Also the method gives good coverages (around the nominal level of 0.95) for , but these coverages deteriorate for if no interpolation is used (coverage is a bit better for grid points). When the LS-SVM interpolation is applied (i.e., ), coverages become very good again, certainly for the largest number of grid points.
5.4 Application 3: The Leaky Competing Accumulator
Elementary decision making has been studied intensively in humans and animals [13]. A common example of an experimental paradigm is the random-motion dot task: the participant has to decide whether a collection of dots (of which only a fraction moves coherently; the others move randomly) is moving to the left or to the right. The stimuli typically have varying levels of difficulty, determined by the fraction of dots moving coherently.
Assuming there are two response options (e.g., left and right), the Leaky Competing Accumulator consists of two evidence accumulators, and (where denotes the time), each associated with one response option. The evolution of evidence across time for a single trial is then described by the following system of two stochastic differential equations:
[TABLE]
where and are uncorrelated white noise processes. To avoid negative values, the evidence is set to 0 whenever it becomes negative: and . The initial values (at ) are .
The evidence accumulation process continues until one of the accumulators crosses a boundary (with ). The coordinate that crosses its decision boundary first, determines the choice that is made and the time of crossing is seen as the decision time. The observed choice response time is seen as the sum of the decision time and a non-decision time , to account for the time needed to encode the stimulus and emit the response.
Equation 26 describes the evolution of information accumulation for a two-option choice RT task, given the presentation of a single stimulus. For all stimuli, the total evidence is equal to , but the differential evidence for option 1 compared to 2 is , which is stimulus dependent and reflects the stimulus difficulty. In this example, we assume the stimuli can be categorized into four levels of difficulty, hence .
The model gives rise to two separate choice response time probability densities, and , each representing the response time conditional on the choice that was made. Integrating the densities over time will result in the probability of choosing the response options: \int_{0}^{\infty}p_{1i}(t)dt=\Pr(\mbox{option 1 for stimulus i}) and \int_{0}^{\infty}p_{2i}(t)dt=\Pr(\mbox{option 2 for stimulus i}). Obviously, when taken together, and sum to one.
All parameters in the parameter vector can take values from [math] to . This parametrization is known to have one redundant parameter [21], so we choose to fix .
The re-parametrization
The prepaid method will not be applied to the model as presented in Equation 26, but rather on a re-parametrized formulation:
[TABLE]
again with the additional restriction that and . The new parameters are defined as follows in terms of the original ones:
[TABLE]
This new parametrization has the advantage that can be interpreted as an inverse time scalar because doubling makes all choice response times twice as fast. This property will allow us to reduce the dimensionality of the prepaid grid (see below). The parameter denotes general stimulus strength scaled according to , while parameter (for coherence) denotes the amount of relative evidence encoded in the stimulus : . It is commonly assumed for these evidence accumulator models that different stimuli should lead to different coherences , but without affecting the other parameters. The nondecision time is not transformed.
Creation of the prepaid grid
For the delineation of the parameter space, we will follow the specifications of [21]. Because this parameter space is rather restrictive (a consequence of the recommendation of [21] to improve parameter recovery), we will extend it through the use of a time scale parameter. This extension will be further discussed when introducing the test set.
First, we create a prepaid grid on a four-dimensional space in the original parametrization by drawing from the following distribution:
[TABLE]
We select 10000 grid points from this distribution using Halton sequences [18, 17]. When working in the reparametrized version, as defined in Equation 27, this space can be transformed to a four dimensional space of , , and .
However, because acts an inverse time scalar on the response time distributions, we may also consider the three dimensional space formed by , , and and for each grid point, choose the parameter in such a way that the RT distributions for options 1 and 2 are scaled to fit nicely between 0 and 3 seconds (with a resolution of 1ms and 3000 time points so that about of the tail mass is allowed to be clipped at 3 seconds when ). Effectively, this brings all RT distributions to the same scale (denoted as ). This process of scaling is illustrated in Figure 11. It reduces both the number of simulations and the storage load (without it we would have to simulate and store a separate set of distributions for each value of ). Note that the scaling is done jointly for all RT distributions associated with a particular . The resulting diffusion constant corresponding to the rescaled distribution is denoted as . In addition, the construction effectively removes one parameter from the prepaid grid, which is illustrated in Figure 12.
To include the coherence parameter, we extend each grid point with a set of predefined coherences. For each point in the grid, we take 50 equally spaced coherences (with ) from [math] to the maximum coherence that still has some non-zero chance of choice option 2 to be selected (we take ). Finally, we simulate for each combination of and a large number of choice response time data (choices and response times). This is illustrated in Figure 11.
In a last step, grid points are eliminated from the prepaid grid, if the simulations result in too many simultaneous arrivals (i.e., trajectories that end at or very close to the intersection point of the two absorbing boundaries at the upper right corner, located at ). More specifically, we drop grid points with more than 0.1 percent simultaneous arrivals. Creating the prepaid database took less then a day on a NVIDIA GeForce GTX 780 GPU.
Prepaid estimation
To explain how the prepaid estimation of the LCA works, let us start with a prototypical experimental design. Assume a choice RT experiment with four stimulus difficulty levels (e.g., four coherences in the random dot motion task). Each difficulty level is administered times to a single participant. A particular trial in this experiment results in , where is the stimulus difficulty level () and is the sequence number within its difficulty level (). The data resulting from this experiment are responses (referring to choice or choice ) and response times . Each pair is considered to originate from an unknown parameter set and coherences ().
Our first aim is to is to establish a local net of prepaid points that lead to data that are close to the observed dataset. If necessary, we can further zoom in with the help of support vector machines. Conditional on each prepaid parameter set in the basic grid, a number of the remaining parameters can be integrated out beforehand. First, conditional on grid point , we have for 50 predetermined coherences simulated accuracies and response time distributions (see Figure 11). The coherences of the observed data can be estimated solely using the observed accuracies using simple linear interpolation. The estimated coherence for stimulus (or condition) is denoted as . Corresponding to each of the 50 coherences for grid point , there is a pair of corresponding simulated RT densities (with ). As before, is scaled to the seconds window, and we can use a combination of translating (estimating ), scaling (estimating ) and interpolating. Specifically, we first calculate as the optimal time scalar to match data with the model on grid point :
[TABLE]
in which
[TABLE]
This formula capitalizes on the fact that the variance of a distribution does not change when it is simply shifted to the right by a constant. Hence, the ratio of the model’s decision time variance (without ) and the observed total response time variance (presumably shifted with some ) is still an estimator of the squared scale factor between them. Using this information, we can estimate the optimal and for grid point as follows:
[TABLE]
with being the optimal scaling diffusion constant used for optimal storage in the database. This gives us a final effective parameter vector of . Note that the last 6 elements of this vector are estimates conditional on the grid point .
Next, we have to determine the single optimal parameter set (and thus also the optimal , , and ). For this we need an objective function that compares the model based PDFs with those of the data. For this purpose, we use a (symmetrized) chi-square distance based on a set of bin statistics. For each stimulus’ observed set of choice RTs, (with the RTs for option 1 and for option 2), we calculate 20 data quantiles (with ) at probability masses . The set of quantiles is appended with one extra quantile at to have a more detailed representation of the leading edge of the distribution. Based on binning edges , we create bin frequencies with . The corresponding probability masses can be easily extracted from the prepaid PDFs as well. Observed and theoretical quantities can then be combined in the a symmetrized chi-square distance:
[TABLE]
This defines a distance between all grid points in the database and any data set.
In the following paragraphs we will present three ways of using this distance to calculate LCA estimates, each a bit more complicated than the previous one (but also more accurate): , , .
The grid point closest to the data set (as measured by the symmetrized chi-square distance function) can be used as a first nearest neighbor estimate.
Not all parameters are treated equally in the estimation procedure. The parameters , and are estimated conditionally on all grid points and then the other parameters are estimated conditionally on , and . Moreover, these parameters are chosen in such a way that a specific aspect of the data (e.g., proportion of choices for option 1) is fitted perfectly (i.e., the coherence is chosen to result in probabilities perfectly equal to the proportions observed in the data). This would be no problem for an infinite amount of data. However, for finite data, the major disadvantage of this way of working is that any errors induced in the precursor step are propagated through the estimation process for , and . This is because for finite data, the observed accuracies will typically not exactly coincide with the accuracies provided by the best model estimates. As the estimates are (on each grid point) exactly fit to the observed accuracy and consequently, the effective grid points will all have this exact accuracy. We tackle this estimator bias by non-parametrically bootstrapping the data and repeating the nearest neighbor estimate for every bootstrapped dataset. Taking the mean of this set of estimates (a method known as bagging; [11]), gives us a more accurate estimate. Additionally, we now have a standard error of the estimate (and confidence interval).
If we apply the bootstrap procedure, it may turn out that the selected grid points as nearest neighbor are not very diverse (this may happen with large sample sizes). In such a situation, it can be worthwhile to use an interpolator. So we may learn a support vector machine based on the bin statistics of the few unique bootstraps grid points available, together with the best overall unique grid points. We propose to use a training set of 100 grid points in total. The SVM can then be used as an approximation for the bin statistics in the space between the grid points and hence for the objective function. We subsequently minimize the approximative SVM based objective function for every bootstrap, using differential evolution (as has been outlined above for the other applications).
Obviously, the quality of the SVM based estimate is limited by the quality of the SVMs that are trained to learn the relation between parameters and statistics. In addition, the same SVMs are used for all bootstrap samples, which may introduce an unwanted distortion in the uncertainty assessment. To account for the systemic bias that might have been introduced by the SVMs, we will add some random noise to each bootstrap estimate. The amount of random deviation that is added equals the size of the prediction error of the SVM. In this way, low quality SVMs are prohibited of biasing all bootstraps in the same way. The uncertainty of the SVMs is now incorporated in the final bootstrapped results.
Test set
The test set is created by uniformly sampling parameters according to Equation 28. Input differences are chosen to produce typical accuracies of 0.6, 0.7, 0.8, and 0.9. As is done in [21], excessively long PDFs (with a maximum RT larger than 5000ms) and excessively short PDFs (with a range below 400ms) are removed from the test set. Apart from the fact that these PDFs are deemed unrealistic [21] for typical choice RT data, this part of the parameter space suffers from inherent poor parameter identifiability, with very large confidence intervals and less meaningful estimates as a consequence. Because the new parametrization analytically integrates out scale (i.e., ) (and also shift ), and is positively unbounded in these dimensions, we can expand the test set to cover a broader range of distributions than the ones covered in [21]. To broaden the range of the test, the distributions are scaled with a random factor ranging from 0.2 to 5. We will use this broadened test set to determine the method’s accuracy and coverage.
Results
Accuracy
The recoveries of the original LCA parameters are displayed in Figures 4, 13, and 14. It can be concluded that for all sample sizes, recovery is acceptable, but it improves a lot for larger sample sizes. In all cases, the recovery is dramatically better than that reported in [21]. Figures 16 and 15 shows RMSE and MAE, respectively, as a function of sample size for three methods (for all parameters). It can be seen that accuracy improves for all parameters for the single best nearest neighbor and for the bootstrap method, until some point, after which it stabilizes or deteriorates. However, for the SVM based estimation, there is still considerable improvement for higher sample sizes.
Coverage
Figure 17 shows the coverages for different numbers of observations. Nearest neighbor bootstrap coverage seems to be adequate for sample sizes up to 10000; for higher sample sizes SVMs are needed to ensure good coverage.
Acknowledgements
The research leading to the results reported in this paper was sponsored in part by Belgian Federal Science Policy within the framework of the Interuniversity Attraction Poles program (IAP/P7/06), as well as by grant GOA/15/003 from the KU Leuven, and the grand for M.M and grant G.0806.13 from the Fund of Scientific Research Flanders.
Author contributions statement
M.M. and S.V. conceived the method; M.M., S.V., and K.M. implemented the method. T.L. and F.T. studied the method theoretically. M.M., S.V., and F.T. wrote the manuscript.
Additional information
The author(s) declare no competing interests.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Beaumont [2010] M. A. Beaumont. Approximate Bayesian Computation in Evolution and Ecology. Annual Review of Ecology, Evolution, and Systematics , 41(1):379–406, 2010. doi: 10.1146/annurev-ecolsys-102209-144621. URL https://doi.org/10.1146/annurev-ecolsys-102209-144621 .
- 2Beaumont et al. [2002] M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate Bayesian Computation in Population Genetics. Genetics , 162(4):2025–2035, Dec. 2002. ISSN 0016-6731, 1943-2631. URL http://www.genetics.org/content/162/4/2025 .
- 3BEAUMONT et al. [2009] M. A. BEAUMONT, J.-M. CORNUET, J.-M. MARIN, and C. P. ROBERT. Adaptive approximate Bayesian computation. Biometrika , 96(4):983–990, 2009. ISSN 0006-3444. URL http://www.jstor.org/stable/27798882 .
- 4Csilléry et al. [2010] K. Csilléry, M. G. B. Blum, O. E. Gaggiotti, and O. François. Approximate Bayesian Computation (ABC) in practice. Trends in Ecology & Evolution , 25(7):410–418, July 2010. ISSN 0169-5347. doi: 10.1016/j.tree.2010.04.001.
- 5Fasiolo and Wood [2014] M. Fasiolo and S. Wood. An introduction to synlik (2014). R package version 0.1.1. 2014. URL http://XXX.org .
- 6Fasiolo et al. [2016] M. Fasiolo, N. Pya, and S. N. Wood. A Comparison of Inferential Methods for Highly Nonlinear State Space Models in Ecology and Epidemiology. Statistical Science , 31(1):96–118, Feb. 2016. ISSN 0883-4237, 2168-8745. doi: 10.1214/15-STS 534. URL https://projecteuclid.org/euclid.ss/1455115916 .
- 7Fermanian and Salanié [2004] J.-D. Fermanian and B. Salanié. A NONPARAMETRIC SIMULATED MAXIMUM LIKELIHOOD ESTIMATION METHOD. Econometric Theory , 20(4):701–734, Aug. 2004. ISSN 1469-4360, 0266-4666. doi: 10.1017/S 0266466604204054. URL https://www.cambridge.org/core/journals/econometric-theory/article/nonparametric-simulated-maximum-likelihood-estimation-method/7E 6CC 1D 3F 7274578 AF 6C 175400 DBC 1F 8 .
- 8Gourieroux and Monfort [1996] C. Gourieroux and A. Monfort. Simulation-based econometric methods . Oxford University Press, 1996.
