The Nested_fit data analysis program

Martino Trassinelli (INSP; INSP-E10)

arXiv:1907.12259·physics.atom-ph·January 30, 2024

The Nested_fit data analysis program

Martino Trassinelli (INSP, INSP-E10)

PDF

TL;DR

Nested_fit is a Bayesian data analysis tool utilizing nested sampling, designed for atomic spectra and physical data, offering model comparison and parameter estimation with a versatile spectral profile database.

Contribution

It introduces Nested_fit, a novel Bayesian analysis program with an upgraded sampling method and a flexible spectral profile database for physical data investigations.

Findings

01

Provides Bayesian evidence for model comparison

02

Offers detailed parameter probability distributions

03

Includes a large, expandable spectral profile database

Abstract

We present here Nested_fit, a Bayesian data analysis code developed for investigations of atomic spectra and other physical data. It is based on the nested sampling algorithm with the implementation of an upgraded lawn mower robot method for finding new live points. For a given data set and a chosen model, the program provides the Bayesian evidence, for the comparison of different hypotheses/models, and the different parameter probability distributions. A large database of spectral profiles is already available (Gaussian, Lorentz, Voigt, Log-normal, etc.) and additional ones can easily added. It is written in Fortran, for an optimized parallel computation, and it is accompanied by a Python library for the results visualization.

Tables1

Table 1. Table 1: Summary of the results provided from Nested_fit for the two considered models. The parameter values are given in terms of most probable value and 95% confidential interval (CI).

	Model 1	Model 2
\midruleFunction	$y = N_{0} e^{- t / τ}$	$y = N_{0} e^{- t / τ} [1 + a \sin (ω t + ϕ)]$
\midrulelog_e(Evidence)	$- 1594.11 \pm 0.30$	$- 1594.60 \pm 0.36$
Probability	34.2–41.9%	58.1–68.8%
Complexity	2.05	15.19
Extracted information [nat]	4.76	6.32
$ω$ (CI 95%) [rad s^-1]	–	$0.89 (0.17 - 6.86)$
$a$ (CI 95%)	–	$9.2 \times 10^{- 2} (2.2 \times 10^{- 4} - 7.2 \times 10^{- 2})$
$ϕ$ (CI 95%) [rad]	–	$3.84 (0.18 - 6.14)$
\bottomrule

Equations10

P (M ∣ {x_{i}, y_{i}}, I) \propto P ({x_{i}, y_{i}} ∣ M, I) \times P (M ∣ I),

P (M ∣ {x_{i}, y_{i}}, I) \propto P ({x_{i}, y_{i}} ∣ M, I) \times P (M ∣ I),

E (M) \equiv P ({x_{i}, y_{i}} ∣ M, I) = \int P ({x_{i}, y_{i}} ∣ a, M, I) P (a ∣ M, I) d^{J} a = \int L^{M} (a) P (a ∣ M, I) d^{J} a,

E (M) \equiv P ({x_{i}, y_{i}} ∣ M, I) = \int P ({x_{i}, y_{i}} ∣ a, M, I) P (a ∣ M, I) d^{J} a = \int L^{M} (a) P (a ∣ M, I) d^{J} a,

E (M) = \int_{0}^{1} L (X) d X,

E (M) = \int_{0}^{1} L (X) d X,

X (L) = \int_{L (a) > L} P (a ∣ I) d^{J} a .

X (L) = \int_{L (a) > L} P (a ∣ I) d^{J} a .

E (M) \approx m \sum L_{m} Δ X_{m},

E (M) \approx m \sum L_{m} Δ X_{m},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

The Nested_fit data analysis program

M. Trassinelli

[email protected]

Institut des NanoSciences de Paris, INSP, CNRS, Sorbonne Université, F-75005 Paris, France

Abstract

We present here Nested_fit, a Bayesian data analysis code developed for investigations of atomic spectra and other physical data. It is based on the nested sampling algorithm with the implementation of an upgraded lawn mower robot method for finding new live points. For a given data set and a chosen model, the program provides the Bayesian evidence, for the comparison of different hypotheses/models, and the different parameter probability distributions. A large database of spectral profiles is already available (Gaussian, Lorentz, Voigt, Log-normal, etc.) and additional ones can easily added. It is written in Fortran, for an optimized parallel computation, and it is accompanied by a Python library for the results visualization.

I Introduction

Nested_fit is a general purpose parallelized data analysis code for the evaluation of Bayesian evidence and parameter probability distributions for given data sets and modeling function. The computation of the Bayesian evidence is based on the nested sampling algorithm Skilling (2004); Sivia and Skilling (2006); Skilling (2006), for the integration of the likelihood function over the parameter space. This integration is obtained reducing the $J$ -dimensional volume (where $J$ is the number of parameters) in a one-dimensional integral by a clever exploration of the parameter space. In Nested_fit, this exploration is obtained with a search algorithm for new parameter values called lawn mower robot, which has been initially developed by L. Simons Theisen (2013) and modified here for a better exploration of multimodal problems.

Nested_fit has been developed over the past years to analyze several sets of experimental data from, mainly, atomic physics experiments. For this reason, it has some special feature well adapted to the analysis of atomic spectra as specific line profiles, possibility to study correlated spectra at the same time, eg. background and signal-plus-background spectra, and with a likelihood function built considering a Poisson statistics per each channel, well adapted to low-statistics data.

In the next section we will describe the general structure and feautres of Nested_fit. In Sec. III we shortly introduce the basic concepts of Bayesian model comparison and the nested sampling method. The specific algorithm for the parameter space exploration for the nested sampling is presented in details in Sec. IV. An example of application of Nested_fit is presented in Sec. V for the analysis of single two-body electron capture ion decay. A conclusive section will end the article, where recent application of Nested_fit to different atomic physics analysis are mentioned.

II General structure of the program

The general structure of the program is represented in Fig. 1. The main input files are two: the file nf_input.dat, where all computation input parameters are included, and the data file, which name is indicated in the parameter input file. The function name in the input file indicates the model to be used for the calculation of the likelihood function. Several functions are already defined in the function library for modelling spectral lines: Gaussian, Log-normal, Lorentzian, Voigt (Gaussian and Lorentzian convolution), Gaussian convoluted with an exponential (for asymmetric peaks), etc. Additional functions can be easily defined by the users in the dedicated routine (USERFNC). Differently from the version presented in Ref. Trassinelli (2017) (V. 0.7), in the new version discussed here (V. 2.2) non-analytical or simulated profile models can be implemented. In this case, one or more additional files have to be provided by the users. These external data, which can have some noise like the case of simulated data, are interpolated by B-splines using FITPACK routines Dierckx (1995). The B-spline parameters are stored and used as profile/model with the total amplitude and a possible offset as free parameters. An additional feature of this new program version, is the possibility to analyze data with error bars. This option has to be indicated in the input file.

Several data sets can be analyzed at the same time by selecting the option “set of data: YES”. This is particularly important for the correct study of physically correlated spectra at the same time, eg. background and signal-plus-background spectra. This is done using a global user-defined function with common parameters of specific models for each spectrum. In the case of multiple data files, the program read an additional input parameter file nf_input_set.dat for the additional datafile names to analyze and data ranges to consider.

The exploration of the parameter space and the corresponding evaluation of the likelihood function is done implementing the nested sampling algorithm Skilling (2004); Sivia and Skilling (2006); Skilling (2006). If the data are in the format $(channel,counts)$ , a Poisson distribution for each channel is assumed for the likelihood function. If the data has error bars $(channel,y,\delta y)$ , a Gaussian distribution is assumed (new feature in V. 2.2).

The main analysis results are summarized in the output file nf_output_res.dat . Here the details of the computation (n. of live points, n. of trials, n. of total iteration) can be found as well as the final evidence value and its uncertainty $E\pm\delta E$ , the parameter values $\boldsymbol{\hat{a}}$ corresponding to the maximum of the likelihood function, the mean, the median, the standard deviation and the confidence intervals (68%, 95% and 99%) of the posterior probability distribution of each parameter. The information gain $\mathcal{H}$ and the Bayesian complexity $\mathcal{C}$ are also provided in the output.

Data for plots and for further analyses are provided in the files nf_output_data_.dat. These files contain the original input data together with the model function values corresponding to the parameters with the highest likelihood function value (nf_output_data_max.dat) or the parameter mean value (nf_output_data_mean.dat) or median value (nf_output_data_median.dat) with the corresponding residuals and error bars. Additional nf_output_fit_.dat files contain a model evaluation with higher density than the original data for graphical presentation purpose.

The step-by-step details of the nested sampling exploration are provided in the file nf_output_points.dat that contains the live points used during the parameter space exploration, their associated likelihood values and posterior probabilities. From this file, the different parameter probability distributions and joint probabilities can be built from the marginalization of the unretained parameters. For this purpose, a special dedicated Phython library Nested_res has been developed. Additional informations can be found in Ref. Trassinelli (2017).

III Implementation of the nested sampling for the evidence calculation

For a given data set(s) $\{x_{i},y_{i}\}$ and model(s) $\mathcal{M}$ , the Bayesian evidence $P(\{x_{i},y_{i}\}|\mathcal{M},I)$ is extracted for the evaluation of the probability to the different models them-selves:

[TABLE]

where $P(\mathcal{M}|I)$ is the prior probability of each model (assumed constant if not specific preferences for the model is present) and $I$ indicates the background information. The Bayesian evidence is the integral value of the likelihood function over the entire parameter space defined by the priors $P(\boldsymbol{a}|\mathcal{M},I)$ :

[TABLE]

where $J$ is the number of the parameters of the considered model, and where we explicitly show the dependency of likelihood function $L^{\mathcal{M}}(\boldsymbol{a})$ on the model $\mathcal{M}$ .

The calculation of the Bayesian evidence is made with the nested sampling, similarly to other available codes Sivia and Skilling (2006); Mukherjee et al. (2006); Feroz and Hobson (2008); Feroz et al. (2009); Veitch and Vecchio (2010). Nested sampling allows for reducing the above integral in the one-dimensional integral

[TABLE]

where $X$ is defined by the relation

[TABLE]

Eq. (3) can be numerically calculated using the rectangle integration method subdividing the $[0,1]$ interval in $M+1$ segments with an ensemble $\{X_{m}\}$ of $M$ ordered points $0<X_{M}<...<X_{2}<X_{1}<X_{0}=1$ . We have then

[TABLE]

where $\mathcal{L}_{m}=\mathcal{L}(X_{m})$ and $\Delta X_{m}=X_{m}-X_{m+1}$ . The evaluation of $\mathcal{L}_{m}$ is obtained by the exploration of the likelihood function with a Monte Carlo sampling via a subsequence of steps. For this, we use a collection of $K$ parameter values $\{\boldsymbol{a}_{k}\}$ that we call live points. More details on the nested sampling algorithm and its implementation can be found in Refs. Skilling (2004, 2006); Sivia and Skilling (2006); Mukherjee et al. (2006); Feroz and Hobson (2008); Feroz et al. (2009); Veitch and Vecchio (2010). The specific implementation of nested sampling in Nested_fit is presented in details in Ref. Trassinelli (2017).

The bottleneck of the nested sampling algorithm is the search of new points within the $J$ -dimensional volume defined by $L>\mathcal{L}_{m}$ . Different methods are commonly employed to accomplish this difficult task. One efficient method is the ellipsoidal nested sampling Mukherjee et al. (2006). It is based on the approximation of the iso-likelihood contour defined by $L=\mathcal{L}_{m}$ by a $J$ -dimensional ellipsoid calculated from the covariance matrix of the live points. The new point is then selected within the ellipsoidal volume (with an enlargement factor selected by the user). This method, well adapted for unimodal posterior distribution has also been extended to multimodal problems Feroz and Hobson (2008); Feroz et al. (2009), i.e. with the presence of distinguished regions of the parameter space with high values of the likelihood function. Other search algorithms are based on Markov chain Monte Carlo (MCMC) methods Veitch and Vecchio (2010) and the recent Galilean Monte Carlo Skilling (2012); Feroz and Skilling (2013), particularly adapted to explore the regions close to the boundary of $V_{L>\mathcal{L}_{m}}$ volumes. Nested_fit program is based on an improved version of the lawn mower robot method, originally developed by L. Simons Theisen (2013) and presented in details in the next section.

IV The lawn mower robot search algorithm

A schematic view of the improved lawn mower robot algorithm is represented in Fig. 2. To cancel the correlation between the starting point and the final point, a series of $N$ jumps are made in this volume. The different stages of the algorithm are

Choose randomly a starting point $\boldsymbol{a}_{n=0}=\boldsymbol{a}_{0}$ from the available live points $\{\boldsymbol{a}_{m,k}\}$ as starting point of the Markov chain where $n$ is the number of the jump. The number of tries $n_{t}$ (see below) is set to zero. 2. 2.

From the values $\boldsymbol{a}_{n-1}$ , find a new parameter sets $\boldsymbol{a}_{n}$ where each $j^{\text{th}}$ parameter is calculated by $(a_{n})_{j}=(a_{n-1})_{j}+f\ r_{j}\sigma_{j}$ , where $\sigma_{j}$ is the standard deviation of the live points of the nested sampling computation step relative to the $j^{\text{th}}$ parameter, $r_{j}\in[-1,1]$ is a sorted random number and $f$ is a factor defined by the user.

(a)

If $L(\boldsymbol{a}_{n})>\mathcal{L}_{m}$ and $n<N$ , go to the beginning of step 2 with an increment of the jump number $n=n+1$ . 2. (b)

If $L(\boldsymbol{a}_{n})>\mathcal{L}_{m}$ and $n=N$ , $\boldsymbol{a}_{n=N}$ is new live point to be included in the new set $\{\boldsymbol{a}_{m+1,k}\}$ . 3. (c)

If $L(\boldsymbol{a}_{n})<\mathcal{L}_{m}$ and $n<N$ and the number of tries $n_{t}$ is less than the maximum allowed number $N_{t}$ , go back to beginning of step 2 with an increment of the number of tries $n_{t}=n_{t}+1$ . 4. (d)

If $L(\boldsymbol{a}_{n})<\mathcal{L}_{m}$ and $n<N$ and $n_{t}=N_{t}$ a new parameter set $\boldsymbol{a}_{0}$ has to be selected. Instead than choosing one of the existing live points, $\boldsymbol{a}_{0}$ is built from distinct $j^{\text{th}}$ components from different live points: $(a_{0})_{j}=(a_{m,k})_{j}$ where $k$ is randomly chosen between 1 and $K$ for each $j$ . Then $\boldsymbol{a}_{n=0}=\boldsymbol{a}_{0}$ and go to the beginning of step 2.

Step 2c, the main improvement of the original lawn mower robot algorithm, makes the algorithm well adapted to problems with multimodal parameter distributions allowing easy jump between high-likelihood regions. The value of $N_{t}$ is fixed in the code ( $N_{t}=10000$ in the present version). The other parameters can be provided by the input file.

V An application to low-statistics data

To show the capabilities of Nested_fit, we present in this section its implementation on a particular critical case corresponding to a debated experiment. In 2008 it was observed an unexpected modulation in the two-body electron capture decay of single H-like ${}^{142}_{~{}61}$ Pm ions to the stable ${}^{142}_{~{}60}$ Nd bare nucleus, with a monochromatic electron-neutrino emission Litvinov et al. (2008). The same modulation frequency, but with much smaller amplitude, was found in 2010 data Kienle et al. (2013) but not in the latest campaign in 2014 Ozturk and et al. (2019) where much more events have been recorded.

The unstable ions are produced by collision with a solid target and then injected in a storage ring where they are cooled down. In the storage ring, the decay time of single ions is measured from changes of the Schottky noise frequency induced by the ion revolution. The H-like ${}^{142}_{~{}61}$ Pm ion and ${}^{142}_{~{}60}$ Nd bare nucleus masses correspond in fact to different revolution frequencies. From the accumulated data of single decay events, the decay probability per unit of time can be measured. An example of the data collected in 2010 is presented in Fig. 3 (up).

The observed modulation of the expected exponential decay has not yet a clear explanation. A possible connection with neutrino masses differences is speculated in the literature. The determination of the presence or not of a modulation is a perfect case for implementing Bayesian model comparison with Nested_fit.

When a possible modulation of the exponential decay is assumed, the likelihood function corresponding to 2010 data presents several maxima. This reflect the periodicity nature of the considered function, which can manifest itself via different harmonics, and the low number of available counts per channel. The difficulties to deal with these multiple likelihood maxima pushed in fact the creation of the improved lawn mower robot algorithm.

In Fig. 3 (top) we present the collected data together with the exponential and modulated exponential functions corresponding to the most probable parameter set. The output result from Nested_fit are presented in Tab. 1 where model 1 and 2 represent the absence of presence of modulation. For each model, values of the evidence, Bayesian complexity and extracted information are provided, as well as model probabilities. The uncertainty of the probabilities is related to the uncertainty of the evidence. As example of probability distribution, we present in Fig. 3 (bottom) the joint probability of the amplitude $a$ and pulsation $\omega$ of the modulation in model 2. The 2D histogram (obtained with Python nested_res.py library that accompany Nested_fit program) is constructed by marginalization on the other parameters. As it can be seen, different maxima are visible, which make difficult the convergence of the nested sampling method. The improved lawn mower robot algorithm can deal with this kind of situation, even if the computation time is sometime long (several days in a single CPU).

As it can be observed, the assigned probability to each model are similar and the confidential intervals for the parameter relative to the modulation model are very large. These two aspects reflect the difficulty to treat this problem where the acquired data are not sufficient to provide a marked preference for one model with or without modulation (see Ref. Ozturk and et al. (2019) for a more extended discussion). Even if apparently unsatisfying, this result avoid however possible over-interpretation of the data commonly encountered when classical methods are employed, as recently discussed in Ref. King et al. (2019) in the context of nuclear physics.

VI Conclusions

We presented here the program Nested_fit, a general purpose parallelized data analysis code for the evaluation of Bayesian evidence and other statistically relevant outputs. It uses the nested sampling method with the implementation of the improved lawn mower robot algorithm for the evaluation of the Bayesian evidence. Nested_fit has been developed over the past years for the analysis of several sets of atomic experimental data that strongly contribute to the code evolution. We would like to mention in particular the analysis of low-statistics X-ray spectra of He-like uranium Trassinelli et al. (2009); Trassinelli (2017), X-ray spectra of pionic atoms Trassinelli et al. (2016a, b), electron photoemission spectra from nano-particles Papagiannouli et al. (2018); Villa et al. (2019), single-ion decay spectra Ozturk and et al. (2019) and response function of crystal X-ray spectrometers (in progress).

Compared to the version reported in Ref. Trassinelli (2017), the presented version (V. 2.2) shows additional important features: i) the possibility to interpolate and use computed or simulated external profiles and ii) the implementation of Gaussian likelihood function for data with error bars.

Future developments of Nested_fit will be focussed on the implementation of new exploration methods for the live point evolution of the nested sampling Feroz and Hobson (2008); Feroz et al. (2009); Skilling (2012); Feroz and Skilling (2013). More precisely, the main goal is the improvement the efficiency for the exploration of the parameter space where the likelihood function presents several local maxima.

\funding

This research received no external funding.

Acknowledgements.

The author thanks the Alexander von Humboldt Foundation that provide the financing to attend the MaxEnt2019 conference. Moreover, the author would like to express once again his deep gratitude to Leopold M. Simons who introduced the author to the Bayesian data analysis and without whom this work could not have been started. The development of this program would not be possible without the close interactions and discussions with many collaborators that the author would like to thank as well: N. Winckler, R. Grisenti, A. Lévy, D. Gotta, Y. Litvinov, J. Machado, N. Paul and all members of the Pionic Hydrogen, FOCAL and GSI Oscillation collaborations and the ASUR group at INSP. \conflictsofinterestThe authors declare no conflict of interest. \reftitleReferences

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Skilling (2004) Skilling, J. Nested Sampling. AIP Conf. Proc. 2004 , 735 , 395–405.
2Sivia and Skilling (2006) Sivia, D.S.; Skilling, J. Data analysis: a Bayesian tutorial , second ed.; Oxford University Press, 2006.
3Skilling (2006) Skilling, J. Nested sampling for general Bayesian computation. Bayesian Anal. 2006 , 1 , 833–859.
4Theisen (2013) Theisen, M. Analyse der Linienform von Röntgenübergängen nach der Bayesmethode. Diplomarbeit, Fakultät für Mathematik, Informatik uns Naturwisssenschaften der RWTH Aachen, 2013.
5Trassinelli (2017) Trassinelli, M. Bayesian data analysis tools for atomic physics. Nucl. Instrum. Methods B 2017 , 408 , 301–312.
6Dierckx (1995) Dierckx, P. Curve and surface fitting with splines ; Oxford University Press, 1995.
7Mukherjee et al. (2006) Mukherjee, P.; Parkinson, D.; Liddle, A.R. A Nested Sampling Algorithm for Cosmological Model Selection. Astrophys. J. Lett. 2006 , 638 , L 51.
8Feroz and Hobson (2008) Feroz, F.; Hobson, M.P. Multimodal nested sampling: an efficient and robust alternative to Markov Chain Monte Carlo methods for astronomical data analyses. Mon. Not. R. Astron. Soc. 2008 , 384 , 449–463.