Algorithm for distance list extraction from pair distribution functions

Ran Gu; Soham Banerjee; Qiang Du; and Simon J. L. Billinge

arXiv:1901.07185·cond-mat.mtrl-sci·January 23, 2019

Algorithm for distance list extraction from pair distribution functions

Ran Gu, Soham Banerjee, Qiang Du, and Simon J. L. Billinge

PDF

Open Access

TL;DR

This paper introduces an automated algorithm that accurately extracts atomic distance lists from pair distribution functions using curve fitting and innovative initialization techniques, applicable to nanostructured samples and similar spectral data.

Contribution

The paper presents a novel, automated algorithm for extracting distance lists from PDFs, incorporating a new initialization approach to handle non-convex optimization challenges.

Findings

01

Effective initial guess improves extraction accuracy

02

Algorithm performs well on nanostructured samples

03

Potential extension to Gaussian-sum spectra

Abstract

We present an algorithm to extract the distance list from atomic pair distribution functions (PDFs) in a highly automated way. The algorithm is constructed via curve fitting based on a Debye scattering equation model. Due to the non-convex nature of the resulting optimization problem, a number of techniques are developed to overcome various computational difficulties. A key ingredient is a new approach to obtain a reasonable initial guess based on the theoretical properties of the mathematical model. Tests on various nanostructured samples show the effectiveness of the initial guess and the accuracy and overall good performance of the extraction algorithm. This approach could be extended to any spectrum that is approximated as a sum of Gaussian functions.

Tables2

Table 1. Table 2: True and extracted peak parameters for the simulated PDF of an 18-atom Lennard-Jones decahedron.

2.9443	0.1	10	2.9000	0.1040	57.5527	2.9008	0.1019	56.9680	2.9007	0.1023	57.3677
True values			$Q_{\max} =$ 30 Å^-1						$Q_{\max} =$ 23 Å^-1
True values			Initial guess			Final extraction			Final extraction
$r (Å)$	$σ (Å)$	$m$	$r (Å)$	$σ (Å)$	$m$	$r (Å)$	$σ (Å)$	$m$	$r (Å)$	$σ (Å)$	$m$
2.8921	0.1	47	2.9000	0.1040	57.5527	2.9008	0.1019	56.9680	2.9007	0.1023	57.3677
4.0525	0.1	5	4.1020	0.1100	15.2690	4.1027	0.1061	14.9729	4.1026	0.1038	14.7513
4.1271	0.1	10	4.1020	0.1100	15.2690	4.1027	0.1061	14.9729	4.1026	0.1038	14.7513
									4.4167	0.2925	1.7502
4.7640	0.1	15	4.7450	0.0800	14.1855	4.7647	0.1001	15.0521	4.7454	0.0891	11.9304
4.9787	0.1	10	5.0110	0.0880	19.8334	4.9942	0.1007	19.8774	4.9786	0.1128	23.6725
5.0092	0.1	10	5.0110	0.0880	19.8334	4.9942	0.1007	19.8774	4.9786	0.1128	23.6725
5.5732	0.1	30	5.5740	0.1045	31.0025	5.5755	0.1023	30.6557	5.5748	0.1036	31.4847
5.7841	0.1	1
									6.2230	0.2271	1.4548
6.7147	0.1	10	6.7120	0.0970	9.8322	6.7147	0.0996	9.9450	6.7168	0.1050	10.9121
									7.2787	0.1356	1.0148
7.7084	0.1	5	7.7190	0.1000	4.9749	7.7084	0.0993	4.9490	7.7079	0.1093	5.9554
									8.6454	0.1834	1.2553

Table 2. Table 4: Interatomic distances from the Lopez-Acevedo (LA) structure model (ground truth) and optimized-extracted peak parameters for the case of data simulated from the LA model, and from experimental PDF data from Au 144 (SC6) 60 clusters

LA model distances		Simulated extraction				Experimental extraction
$r (Å)$	$m$	$r (Å)$	$σ (Å)$	$m$		$r (Å)$	$σ (Å)$	$m$
2.68-3.06	528.00	2.87	0.13	552.80		2.85	0.12	565.68
3.08-3.33	102.00	3.21	0.11	105.76		3.15	0.14	128.01
3.35-3.49	12.00	3.45	0.09	16.95		3.40	0.13	73.61
		3.67	0.08	7.66
3.82	1.00					3.79	0.12	38.66
3.87-4.20	179.00	4.03	0.13	198.15		4.03	0.11	181.19
4.30	1.00
4.40-4.47	40.00	4.44	0.12	28.49		4.41	0.09	58.26
4.49-4.80	214.00	4.64	0.16	220.81		4.68	0.14	270.63
4.82-5.25	468.00	4.98	0.16	497.69		5.00	0.14	415.99
5.26-5.54	387.00	5.41	0.13	399.73		5.38	0.11	363.49
5.55-5.81	369.00	5.66	0.13	355.96		5.64	0.13	349.33
5.83-6.11	105.00	5.93	0.16	132.43		5.91	0.15	126.89
6.13-6.25	43.00	6.19	0.09	30.64		6.19	0.16	29.33
6.27-6.37	14.00
6.40-6.85	442.00	6.62	0.16	444.94		6.47	0.12	113.53	395.06
6.40-6.85	442.00	6.62	0.16	444.94		6.66	0.13	281.53	395.06
6.87-6.96	26.00
6.98-7.81	1425.00	7.19	0.16	528.45	1335.57	7.09	0.13	254.19	1233.05
		7.49	0.16	721.79		7.33	0.15	496.30
		7.71	0.11	85.33		7.59	0.16	482.56

Equations100

g (r) = \frac{1}{r} \frac{1}{N ⟨ f ⟩ ^{2}} j \neq = l \sum f_{j}^{*} f_{l} δ (r - r_{j l}) .

g (r) = \frac{1}{r} \frac{1}{N ⟨ f ⟩ ^{2}} j \neq = l \sum f_{j}^{*} f_{l} δ (r - r_{j l}) .

G (r) = \frac{2}{π} \int_{0}^{\infty} F (Q) sin (Q r) d Q,

G (r) = \frac{2}{π} \int_{0}^{\infty} F (Q) sin (Q r) d Q,

F (Q) = \frac{1}{N ⟨ f ⟩ ^{2}} l \neq = j \sum f_{j}^{⋆} f_{l} \frac{sin ( Q r _{j l} )}{r _{j l}} .

F (Q) = \frac{1}{N ⟨ f ⟩ ^{2}} l \neq = j \sum f_{j}^{⋆} f_{l} \frac{sin ( Q r _{j l} )}{r _{j l}} .

G (r) = \frac{2}{π} \int_{Q_{m i n}}^{Q_{m a x}} F (Q) sin (Q r) d Q .

G (r) = \frac{2}{π} \int_{Q_{m i n}}^{Q_{m a x}} F (Q) sin (Q r) d Q .

G (r) \approx \frac{2}{π} i = 1 \sum N_{Q} F (Q_{i}) sin (Q_{i} r) Δ Q_{i},

G (r) \approx \frac{2}{π} i = 1 \sum N_{Q} F (Q_{i}) sin (Q_{i} r) Δ Q_{i},

G (r) = \frac{R ( r )}{r} - 4 π ρ_{0} γ_{0} (r) r

G (r) = \frac{R ( r )}{r} - 4 π ρ_{0} γ_{0} (r) r

4 π ρ_{0} γ_{0} (r) r = \frac{2}{π} \int_{0}^{Q_{m i n}} F (Q) sin (Q r) d Q

4 π ρ_{0} γ_{0} (r) r = \frac{2}{π} \int_{0}^{Q_{m i n}} F (Q) sin (Q r) d Q

F (Q) = \frac{1}{N ⟨ f ⟩ ^{2}} l \neq = j \sum f_{j}^{⋆} f_{l} (e^{- \frac{1}{2} σ_{j l}^{2} Q^{2}}) \frac{sin ( Q r _{j l} )}{r _{j l}} .

F (Q) = \frac{1}{N ⟨ f ⟩ ^{2}} l \neq = j \sum f_{j}^{⋆} f_{l} (e^{- \frac{1}{2} σ_{j l}^{2} Q^{2}}) \frac{sin ( Q r _{j l} )}{r _{j l}} .

\hat{F} (Q) = \frac{F ( Q )}{f _{j}^{⋆} f _{l}} \mbox an d \hat{G} (r) = \frac{2}{π} \int_{Q_{m i n}}^{Q_{m a x}} \hat{F} (Q) sin (Q r) d Q .

\hat{F} (Q) = \frac{F ( Q )}{f _{j}^{⋆} f _{l}} \mbox an d \hat{G} (r) = \frac{2}{π} \int_{Q_{m i n}}^{Q_{m a x}} \hat{F} (Q) sin (Q r) d Q .

\hat{F} (Q) = \frac{1}{N ⟨ f ⟩ ^{2}} l \neq = j \sum (e^{- \frac{1}{2} σ_{j l}^{2} Q^{2}}) \frac{sin ( Q r _{j l} )}{r _{j l}}

\hat{F} (Q) = \frac{1}{N ⟨ f ⟩ ^{2}} l \neq = j \sum (e^{- \frac{1}{2} σ_{j l}^{2} Q^{2}}) \frac{sin ( Q r _{j l} )}{r _{j l}}

\hat{F} (Q) = i = 1 \sum k \frac{m _{i}}{r _{i}} e^{- \frac{1}{2} σ_{i}^{2} Q^{2}} sin (Q r_{i}) .

\hat{F} (Q) = i = 1 \sum k \frac{m _{i}}{r _{i}} e^{- \frac{1}{2} σ_{i}^{2} Q^{2}} sin (Q r_{i}) .

\hat{G} (r)

\hat{G} (r)

= \frac{2}{π} {\int_{0}^{\infty} - \int_{0}^{Q_{m i n}} - \int_{Q_{ma x}}^{\infty}} i = 1 \sum k \frac{m _{i}}{r _{i}} e^{- \frac{1}{2} σ_{i}^{2} Q^{2}} sin (Q r_{i})

sin (Q r) d Q

= \hat{G}_{1} (r) + \hat{G}_{2} (r) + \hat{G}_{3} (r)

\hat{G}_{3} (r) =

\hat{G}_{3} (r) =

\leq

\leq

\hat{G}_{1} (r)

\hat{G}_{1} (r)

= \frac{1}{2 π} i = 1 \sum k \frac{m _{i}}{r _{i} σ _{i}} (e^{- \frac{( r - r _{i} ) ^{2}}{2 σ _{i}^{2}}} - e^{- \frac{( r + r _{i} ) ^{2}}{2 σ _{i}^{2}}}),

\hat{G}_{2} (r) =

\hat{G}_{2} (r) =

=

=

+ i = 1 \sum k \frac{m _{i} σ _{i}^{2}}{r _{i}} O (Q_{m i n}^{3})

\frac{1}{π} i = 1 \sum k \frac{m _{i}}{r _{i}} (\frac{sin (( r - r _{i} ) Q _{m i n} )}{r - r _{i}} - \frac{sin (( r + r _{i} ) Q _{m i n} )}{r + r _{i}})

\frac{1}{π} i = 1 \sum k \frac{m _{i}}{r _{i}} (\frac{sin (( r - r _{i} ) Q _{m i n} )}{r - r _{i}} - \frac{sin (( r + r _{i} ) Q _{m i n} )}{r + r _{i}})

\hat{G}_{1}^{''} (r) = \frac{1}{2 π} i = 1 \sum k \frac{- m _{i}}{r _{i} σ _{i}^{3}} [(1 - \frac{( r - r _{i} ) ^{2}}{σ _{i}^{2}}) e^{- \frac{( r - r _{i} ) ^{2}}{2 σ _{i}^{2}}} - (1 - \frac{( r + r _{i} ) ^{2}}{σ _{i}^{2}}) e^{- \frac{( r + r _{i} ) ^{2}}{2 σ _{i}^{2}}}] .

\hat{G}_{1}^{''} (r) = \frac{1}{2 π} i = 1 \sum k \frac{- m _{i}}{r _{i} σ _{i}^{3}} [(1 - \frac{( r - r _{i} ) ^{2}}{σ _{i}^{2}}) e^{- \frac{( r - r _{i} ) ^{2}}{2 σ _{i}^{2}}} - (1 - \frac{( r + r _{i} ) ^{2}}{σ _{i}^{2}}) e^{- \frac{( r + r _{i} ) ^{2}}{2 σ _{i}^{2}}}] .

\hat{G}_{1}^{''} (r) = \frac{1}{2 π} \frac{- m _{i}}{r _{i} σ _{i}^{3}} (1 - \frac{( r - r _{i} ) ^{2}}{σ _{i}^{2}}) e^{- \frac{( r - r _{i} ) ^{2}}{2 σ _{i}^{2}}} .

\hat{G}_{1}^{''} (r) = \frac{1}{2 π} \frac{- m _{i}}{r _{i} σ _{i}^{3}} (1 - \frac{( r - r _{i} ) ^{2}}{σ _{i}^{2}}) e^{- \frac{( r - r _{i} ) ^{2}}{2 σ _{i}^{2}}} .

σ_{i} = \frac{z _{2}^{⋆} - z _{1}^{⋆}}{2} .

σ_{i} = \frac{z _{2}^{⋆} - z _{1}^{⋆}}{2} .

\hat{G}_{2}^{''} (r) =

\hat{G}_{2}^{''} (r) =

=

=

=

+ \frac{2 sin (( r - r _{i} ) Q _{m i n} )}{( r - r _{i} ) ^{3}}) - (\frac{- Q _{m i n}^{2} sin (( r + r _{i} ) Q _{m i n} )}{r + r _{i}}

+ \frac{- 2 Q _{m i n} cos (( r + r _{i} ) Q _{m i n} )}{( r + r _{i} ) ^{2}} + \frac{2 sin (( r + r _{i} ) Q _{m i n} )}{( r + r _{i} ) ^{3}})] + O (Q_{m i n}^{5})

\frac{- Q _{m i n}^{2} sin (( r - r _{i} ) Q _{m i n} )}{r - r _{i}} + \frac{- 2 Q _{m i n} cos (( r - r _{i} ) Q _{m i n} )}{( r - r _{i} ) ^{2}}

\frac{- Q _{m i n}^{2} sin (( r - r _{i} ) Q _{m i n} )}{r - r _{i}} + \frac{- 2 Q _{m i n} cos (( r - r _{i} ) Q _{m i n} )}{( r - r _{i} ) ^{2}}

+ \frac{2 sin (( r - r _{i} ) Q _{m i n} )}{( r - r _{i} ) ^{3}}

=

=

\frac{- Q _{m i n}^{2} sin (( r - r _{i} ) Q _{m i n} )}{r - r _{i}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpectroscopy and Chemometric Analyses · Molecular Sensors and Ion Detection

Full text

\journalcode

A

GuBanerjee\cauthor[a][email protected] Billinge\aff[a]Department of Applied Physics and Applied Mathematics, Fu Foundation School of Engineering & Applied Sciences, Columbia University, \countryUSA \aff[b] Condensed Matter Physics and Materials Science Department, Brookhaven National Laboratory, \cityUpton, NY 11973, \countryUSA

Algorithm for distance list extraction from pair distribution functions

Ran

Soham

Simon J. L

Abstract

We present an algorithm to extract the distance list from atomic pair distribution functions (PDFs) in a highly automated way. The algorithm is constructed via curve fitting based on a Debye scattering equation model. Due to the non-convex nature of the resulting optimization problem, a number of techniques are developed to overcome various computational difficulties. A key ingredient is a new approach to obtain a reasonable initial guess based on the theoretical properties of the mathematical model. Tests on various nanostructured samples show the effectiveness of the initial guess and the accuracy and overall good performance of the extraction algorithm. This approach could be extended to any spectrum that is approximated as a sum of Gaussian functions.

keywords:

pair distribution function, distance list, peak extraction, Debye scattering equation, curve fitting

1 Introduction

Determining the three-dimensional atomic positions in a nanostructure is one of the great challenges in materials science and engineering [billinge2007problem]. One experimentally accessible encoding of the local structure is the atomic pair distribution function (PDF), which is fundamentally a list of inter-atomic distaces in the material [egami;b;utbp12, warren2012x]. This fact has led to a mathematical description of the nanostructure inverse problem as the unassigned distance geometry problem (uDGP) [billi;4or18, duxbu;dam16, duxbu;4or16, juhas;jac10, juhas;n06]. The PDF can be obtained by taking a Fourier transform of the structure function, which is extracted from the measured total scattering of x-rays, neutrons or electrons from a sample. The PDF method is widely used to study nanostructures [egami;b;utbp12, billinge2004beyond, billinge2007problem, young;jmc11, proff;jac97, page;jac11i, cliff;prl10]

The experimental peak width is determined by both physical properties and experimental resolution [egami;b;utbp12]. In high symmetry structures such as bulk Ni, which crystallizes in a face-centered cubic lattice, distances of the same length occur frequently and the degeneracy of each distance can be estimated from the integrated area of each peak [egami;b;utbp12]. However, a major challenge in determining the list of inter-atomic distances from a measured PDF comes from the fact that different interatomic vectors with similar lengths cannot be resolved due to peak overlap. This is not a problem if we have a good structural model which can be fit to the data, which is the basis of PDF fitting programs such as PDFgui [billi;b;lsfd98, proff;jac99, farro;jpcm07], an approach that is the real-space equivalent of Rietveld refinement of powder diffraction data [rietv;jac69]. However, it presents a significant problem for programs that extract peak positions and intensities in the absence of a structural model, which would be the real-space equivalent of LeBail [lebail;mrb87] and Pawley [pawle;jac81] refinement in the powder diffraction world.

A program for extracting distance-lists from measured PDFs has been reported. ParSCAPE is an algorithm which can extract this complete set of information from the PDF by using the information-theoretic Akaike information criterion (AIC) [granlund2015algorithm], available as a program SrMise on Diffpy.org. However, the PDF baseline must be specified before peak extraction, and results are conditioned upon it. The correct estimations of PDF baselines, especially from nanoparticle PDFs, remains challenging and requires human intervention, which is a drawback preventing full automation of SrMise.

Developing an algorithm for peak extraction which is automatable and robust to details of the baseline is our main goal here. From the curve fitting point of view, an estimated distance list can be regarded as a variable to generate a simulated PDF based on the given mathematical model, then one may minimize the residual of the simulated PDF with respect to a target PDF to obtain the optimized distance list. However the resulting curve fitting is generically a non-convex programming problem. In order to solve the problem more effectively, we analyze the properties of the mathematical model which allows us to construct, automatically, the initial guess of the variables, with good fitting results demonstrated in preliminary tests using simulated and experimental PDF data.

This paper is organized as follows: In Section 2, we briefly introduce the PDF method and present the mathematical model that we use to approximate the experimental PDF. Section 3 is a theory section containing the analysis of properties of the mathematical model. This leads to an approach to guess the initial values of all variables. Section 4 describes the formulation of the PDF distance list optimization. In Section 5, we present results from simulated and experimental PDF datasets used for testing the algorithm, and Section 6 contains a summary of the main points of the paper.

2 Mathematical Model of PDF

We consider a nanostructure with a set of atoms. Let $N$ be the total number of atoms in the structure, and $\{r_{j}\}_{j=1}^{N}$ denote the positions of the atoms. The ideal PDF is defined by [farro;aca09, egami;b;utbp12]

[TABLE]

Here, for $j=1\ldots N$ and $l=1,\ldots N$ , $r_{jl}$ is the distance between atoms $j$ and $l$ located at positions $r_{j}$ , and $r_{l}$ so that $r_{jl}=||r_{l}-r_{j}||$ , where $||.||$ is the Euclidean norm, $f_{j}$ is the scattering power of the atom at position $r_{j}$ , and $f^{\star}_{j}$ is its complex conjugate.

The ideal PDF (1) may also be obtained from measured data according to

[TABLE]

where $F(Q)=Q[S(Q)-1]$ is the normalized and corrected powder diffraction intensity, which is expressed in the Debye Scattering Equation as

[TABLE]

Here the quantity $S(Q)$ is called the structure function and $F(Q)$ the reduced structure function [warren2012x] and $Q$ is the magnitude of the scattering vector.

Due to physical constraints in the experiment, the variable $Q$ takes only values in the interval $[Q_{\min},Q_{\max}]$ . Thus, different from the standard Fourier transform, the PDF is obtained by the integral on this confined interval, [farro;aca09]

[TABLE]

To compute $G(r)$ numerically from $F(Q)$ on the discrete $Q_{i}$ grid, we approximate $G(r)$ by using the finite sum

[TABLE]

where $N_{Q}$ is the number of discrete values of $Q$ , and $\Delta Q_{i}$ is the difference between two adjacent values of $Q$ .

By ignoring the finite $Q_{\max}$ , $G(r)$ and the radial distribution function $R(r)$ are related by

[TABLE]

where $\rho_{0}$ is the average density and $\gamma_{0}$ is the characteristic function of the sample shape [fournet1955small, farro;aca09]. The term

[TABLE]

is a baseline, and we may think of $G(r)$ as the baseline plus peaks. In the literature to-date the shape of the baseline is either determined directly from the shape of the structural model [juhas;aca15, farro;jpcm07], or approximated using expansions of ad hoc mathematical functions [korsunskiy2005exact, neder2005structure, korsunskiy2007aspects, neder2007structural]. In the case of bulk crystals, $\gamma_{0}(r)\approx 1$ , is a linear baseline [egami1998local, proffen1999pdffit, farro;jpcm07]. However in general, without a good structural model, the baseline is not known a priori.

The experimental signal is a time and ensemble average of large numbers of atoms and the Dirac delta-function peaks given in the ideal PDF (1) broaden into nearly Gaussian peaks. In reciprocal space, to account for atomic motion, (3) is replaced by a version that includes Debye-Waller effects,

[TABLE]

Here, $\sigma_{jl}$ is the correlated broadening factor for the atom pair [proffen1999pdffit, thorpe2002semiconductors, jeong2003lattice]. This mathematical model has been successfully used to study nanostructures by a number of authors [zhang2003water, cervellino2006efficient].

For the case of samples made of a single atom type, the atomic form factors $f_{j}$ can be factored out resulting in new functions

[TABLE]

We regard

[TABLE]

as our mathematical model. If the material contains different atomic types, we use Equation (7) instead.

In Equation (9), we want to merge distances of the same length together. Notice that distances of the same length may have different $\sigma$ . Nevertheless, we still put them together because peaks at the same position are more difficult to differentiate. We then obtain the following mathematical model,

[TABLE]

Here, $k$ is the number of different values of unresolved distances. $m_{i}$ represents the relative multiplicity which is equal to multiplicity times $1/(N\langle f\rangle^{2})$ .

During curve fitting, we first determine the value of $k$ . Then we recognize all the $r_{i}$ , $m_{i}$ , $\sigma_{i}$ as variables. In the next section, we discuss how to construct the initial guesses for these variables.

3 Mathematical Model Analysis and Initial Guess

This section is divided into several parts. We first present some properties of the mathematical model used to calculate PDFs, and then describe a few approaches for determining an initial guess distance list, using different atomic structures as examples.

3.1 Properties

In real experiments, the intensities are measured only over a range $Q_{min}<Q<Q_{max}$ , which introduces aberrations to the data that must be handled by our automated algorithm. To explore this in more detail we first consider a low energy 18-atom Lennard-Jones decahedral cluster [wales2001cambridge]. Figure 1(a) shows the function $\hat{F}(Q)$ , calculated from Equation (10) using the decahedral structure model, over the $Q$ -range from $Q_{\min}=0~{}\text{\AA}^{-1}$ to $Q_{max}=30~{}\text{\AA}^{-1}$ , with $\sigma_{i}$ set to $0.1$ Å.

The three curves in the bottom panel of Figure 1 show the function $\hat{G}(r)$ , calculated from Equation (5) where $\sigma_{i}$ is kept fixed at $0.1$ Å, and the $Q$ -ranges are varied in order to illustrate the effects on the transformed PDFs. There are a number of peaks in $\hat{G}(r)$ where each peak represents one or more distances. We take as reference a PDF calculated with a small but finite $Q_{\mathrm{min}}$ , and a large $Q_{\mathrm{max}}$ ( $Q_{\min}=0.5~{}\text{\AA}^{-1}$ , $Q_{\max}=30~{}\text{\AA}^{-1}$ ) which is shown as the dark blue curve, and compare it with the red curve ( $Q_{\min}=0~{}\text{\AA}^{-1}$ , $Q_{\max}=30~{}\text{\AA}^{-1}$ ), where we find that a larger $Q_{\mathrm{min}}$ makes the $\hat{G}(r)$ baseline deeper, as expected [farro;aca09, egami;b;utbp12]. Comparing the same dark blue curve with the green curve ( $Q_{\min}=0.5~{}\text{\AA}^{-1}$ , $Q_{\max}=20~{}\text{\AA}^{-1}$ ), we find that smaller $Q_{\mathrm{max}}$ leads to larger oscillations in $\hat{G}(r)$ , again as expected [egami;b;utbp12]. The coordinates of the atoms in the structure models were determined algorithmically using the Atomic Simulation Environment (ASE) Python package [Larsenatomicsimulationenvironment2017], as described in [BanerjeeImprovedModelsMetallic2018a].

To consider the effects of the $Q$ -range we decompose the PDF into different contributions from the different ranges of $Q$ . According to Equations (8) and (10), we have

[TABLE]

Here we split $\hat{G}(r)$ to three parts. If $Q_{max}$ is large enough, we regard the integral from ${Q_{max}}$ to $\infty$ as 0 due to the exponential term. The termination of the Fourier transform varies with the type of material and with the amplitude of lattice vibrations but, in general, termination with $Q>30~{}\text{\AA}^{-1}$ produces minimal errors [toby1992accuracy]. However, for lower ${Q_{max}}$ the ripples may be signicant (e.g., green PDF in the bottom panel of Figure 1), in which case, in the absence of a structural model, the oscillations may be mistaken as physical peaks. This increases the computational effort for extraction. We can use the following inequality as a threshold to reduce the number of mis-identified peaks,

[TABLE]

where $r_{1}$ is the smallest distance and $\underline{\sigma}$ is a lower bound of all $\sigma_{i}$ .

The unattenuated term contains the atomic-scale structural information and is given by

[TABLE]

which is a sum of Gaussians as expected.

The part that determines the baseline is given by [farro;aca09]

[TABLE]

which has been simplified here by taking terms only up to second order in a Taylor series expansion. Approximating the baseline by

[TABLE]

does provide a good approximation as shown in Figure 2.

3.2 Initial Guess on a single peak

If the target function is a sum of Gaussians, one can make an initial guess using the second order derivative [goshtasby1994curve]. Approaches using higher order derivatives and wavelet transforms have been applied to a variety of experimentally measured spectra to find local maxima, often combined with gaussian denoising filters [HuangPrecisionPeakDetermination1988, Gregoirewavelettransformalgorithm2011, SavitzkySmoothingDifferentiationData1964]. However, our mathematical model is not exactly a sum of Gaussians. The Gaussians are contained in $\hat{G}_{1}(r)$ , but in general our signal also includes a baseline term, $\hat{G}_{2}(r)$ , and termination effects, $\hat{G}_{3}(r)$ . We now consider the effects from these different $Q$ -dependent contributions on our ability to accurately extract peak parameters using second and higher order derivatives.

We begin by considering $\hat{G}_{1}(r)$ , which contains the structural signal. The second derivative of $\hat{G}_{1}(r)$ , following Equation (17), is

[TABLE]

First, we take the simplest case of an isolated single peak that we label as the $i$ -th peak. Due to the characteristics of the exponential function, other peaks are some distance away and may affect the single peak slightly. Then, $r_{i}$ can be extracted by using the location of the local maximum of $\hat{G}(r)$ or $-\hat{G}^{\prime\prime}(r)$ . Consider $r$ around $r_{i}$ ,

[TABLE]

We have two zero crossing points of $\hat{G}_{1}^{\prime\prime}(r)$ , $z_{1}^{\star}=r_{i}-\sigma_{i}$ and $z_{2}^{\star}=r_{i}+\sigma_{i}$ . Then we take as the initial guess of $\sigma_{i}$ ,

[TABLE]

This result will be accurate if the curved baseline contribution to $\hat{G}$ , $\hat{G}_{2}(r)$ , does not introduce a significant shift on the zero crossing points.

Consider the second order derivative of $\hat{G}_{2}(r)$ ,

[TABLE]

where we have again used the Taylor expansion. Define $\eta=(r-r_{i})Q_{\min}$ . If $\eta$ is close to zero, by Taylor expansion,

[TABLE]

On the other hand, if $\eta$ is away from zero, then

[TABLE]

Therefore,

[TABLE]

This is small and we can, with confidence, set $z^{\star}$ and $z$ to be the zero crossings on the same side of $\hat{G}_{1}^{\prime\prime}(r)$ and $\hat{G}_{1}^{\prime\prime}(r)+\hat{G}_{2}^{\prime\prime}(r)$ , respectively. Further, we can ignore $G_{3}^{\prime\prime}$ because when $Q_{max}$ is large enough, we can approximate it as zero due to its exponential term. This means that although $G(r)$ is not purely a sum of Gaussians, the multiple derivative zero crossings method can still be expected to give acceptably good initial estimates of single peak positions. Then, we have

[TABLE]

and by the Mean Value Theorem,

[TABLE]

where $\hat{z}$ is a real number between $z$ and $z^{\star}$ . For $r$ around $r_{i}$ ,

[TABLE]

When $r\in(r_{i}-\sqrt{2}\sigma_{i},r_{i}-0.2\sigma_{i})\cup(r_{i}+0.2\sigma_{i},r_{i}+\sqrt{2}\sigma_{i})$ ,

[TABLE]

Therefore, we have

[TABLE]

To compute different even order derivatives of $\hat{G}(r)$ numerically, we do not use finite difference approximations, because it is very easy to produce numerical instabilities when calculating higher order derivatives. Instead, we calculate the derivatives directly on the formula

[TABLE]

We only consider even order derivatives of $\hat{G}(r)$ , which give the peaked functions that we seek for the zero crossing analysis. Intuitively, $r^{\star}$ is the maximizer/minimizer of $\hat{G}^{(2s)}(r)$ and then $\hat{G}^{(2s)+1}(r^{\star})=0$ . Therefore,

[TABLE]

When the measured $\hat{F}(Q)$ is taken only at a finite point $Q_{i}$ , we approximate

[TABLE]

where $N_{Q}$ is the number of discrete values of $Q$ , and $\Delta Q_{i}$ is the difference between two adjacent Q. The high order derivatives obtained from $\hat{F}(Q)$ by this method are more stable. When $\hat{F}(Q)$ is not available from experimental data, a finite difference approximation can be used to calculate higher order derivatives from experimentally measured $G(r)$ .

In Figure 3, the blue curve, $\hat{G}(r)$ , is the simulated PDF of an 18-atom decahedron introduced in Section 3.1, with $Q_{\min}=0.5~{}\text{\AA}^{-1}$ , $Q_{\max}=30~{}\text{\AA}^{-1}$ and $\sigma_{i}=0.1~{}\text{\AA}$ for all $i=1,\ldots,k$ .

The red curve, $-\hat{G}^{\prime\prime}(r)$ , is calculated as given in Equation (41). A magnified view is shown in the inset from $\sim$ 2.5-5.3 Å. The distances between the zero crossing of the first two peaks are very close to $0.2~{}\text{\AA}$ which is twice the value of $\sigma$ used to generate this PDF. This shows that it is reasonable to guess the initial value of $\sigma_{i}$ using the zero-crossing estimation given in Equation (24).

3.3 Initial guess for the case of overlapped peaks

Figure 3 shows another interesting phenomenon. The overlapped peaks near $4.8$ Å appear as two well separated peaks in the second derivative curve shown in red. This motivates us to use higher derivatives to search for overlapped peaks.

We consider the $n$ -th derivative, where $n$ is an even number, $n=2s$ . Using the fact that

[TABLE]

where $C_{n}^{k}$ is the number of combinations of $n$ items taken $k$ at a time which is defined as $n!/k!(n-k)!$ , $!$ is factorial, and $!!$ is double factorial. Considering $r$ around $r_{i}$ and following Equation (17), we again take a Taylor expansion giving

[TABLE]

The two nearest zero crossing points of $\hat{G}_{1}^{(n)}(r)$ are $z_{1}^{\star}\approx r_{i}-\frac{1}{\sqrt{s}}\sigma_{i}$ and $z_{2}^{\star}\approx r_{i}+\frac{1}{\sqrt{s}}\sigma_{i}$ . Similar to Equation (32), we obtain that $\hat{G}_{2}^{(n)}(r)$ does not affect this guess too much due to

[TABLE]

we can again use

[TABLE]

as the initial guess of $\sigma_{i}$ .

The PDF simulated from a different model, a 39-atom decahedral cluster, provides an illustration of another interesting point of using zero-crossings from higher order derivatives to locate peaks. Figure 4(a) shows the PDF calculated from this model. We set all $\sigma_{i}$ to be $0.1~{}\text{\AA}$ , $Q_{\min}=0.5~{}\text{\AA}^{-1}$ , $Q_{\max}=30~{}\text{\AA}^{-1}$ .

The peak in the interval from 8.1 Å to 8.5 Å looks single valued but contains two true peaks at 8.23 Å and 8.40 Å, respectively. In Figure 4(b) we magnify this narrow $r$ -range and overlay the 2nd order (red) and 4th order (green) derivatives on top of the simulated PDF (light blue). This shows that the 2nd order derivative with only two zero-crossings cannot sufficiently resolve the split peak, whereas the 4th order derivative can. In Figure 4(b), the initial guesses for the position of the two peaks are highlighted with teal arrows, which are determined from the four zero-crossings (purple markers).

In this case, $\hat{G}^{(4)}(r)$ , was optimal for separating these two peaks. In practice, higher derivatives give greater selectivity for finding overlapped peaks, but also dramatically increases the number of zero crossings originating from noise, and a balance must be struck between these two competing factors. As a rule of thumb, we have found that the $(n+2)$ -th derivative should be considered only when the $n$ -th derivative does not result in reasonable initial guesses for $r_{i}$ and $\sigma_{i}$ . In practice, for a fully automated peak extraction program we do not want human involvement in the decision making. We have found that $\hat{G}^{(4)}(r)$ is a good balance between sensitivity and noise suppression in the examples we have tried. In the future, we may experiment with different protocols, for example, adaptively trying derivatives of different order, and even changing the order used to extract signals from specific peaks in the PDF. These improvements to the automated heuristic have not proven to be necessary to date.

3.4 Initial guess for the peak amplitude, $m_{i}$

After giving the initial guesses for $r_{i}$ and $\sigma_{i}$ , we can estimate values for the peak amplitudes, $m_{i}$ , by using the following standard box constrained least square,

[TABLE]

where $n_{Q}$ is the number of discrete values of $Q$ and the $r_{i}$ and $\sigma_{i}$ values are held constant. This is a convex quadratic programming problem which can be easily solved without having to specify initial values for the $m_{i}$ . For example, here we use the primal-dual interior-point algorithm [nesterov1997self, nesterov1998primal].

4 Optimization

With initial values for the variables we can continue to the optimization step. A standard box constrained least square problem is used to fit either the $\hat{F}(Q)$ or the $\hat{G}(r)$ curve according to

[TABLE]

and

[TABLE]

We have found that compared to fitting on $\hat{F}(Q)$ , the real space optimization is more computationally intensive, but can yield better solutions. For the optimization we use a subspace trust-region method based on the interior-reflective Newton method described in [coleman1996interior]. Each iteration involves the approximate solution of a large linear system using the method of preconditioned conjugate gradients (PCG). Due to the high nonlinearity and non-convexity of this least square problem, the solutions calculated by the solver depend sensitively on the starting values. Nonetheless, the initial values we obtained from the differential zero crossings have proven to be stable.

5 Testing the approach

Two target PDFs are tested for our extraction algorithm in this section. One shows the peaks extracted from a simulated PDF, and the other from an experimental PDF of atomically precise clusters where the structure has been satisfactorily solved.

5.1 Test on simulated data.

We revisit the 18-atom Lennard-Jones decahedron discussed in Sections 3.1-3.2 to generate a simulated PDF for this test. Here we set $Q_{\min}=0.5~{}\text{\AA}^{-1}$ , $Q_{\max}=30~{}\text{\AA}^{-1}$ , and all $\sigma_{i}=0.1~{}\text{\AA}$ . The distances extracted using our approach are reproduced in Table 2 and the resulting PDF curves after the initial guess and the full refinement steps are shown in Figure 5. For this relatively high resolution ( $Q_{max}=30$ Å*-1*) case the extraction is working rather well. The only peaks that could not be extracted separately by the program were very close to each other and the extraction returns single peaks with the full integrated intensities of both unresolved distances.

The first three columns in the table show the ground-truth parameters we set to generate the PDF. There are 11 different distances, of which we find seven. The program could not resolve the peaks at 2.8921 Å and 2.9443 Å in the first feature. A single peak was returned at 2.9008 Å, very close to the weighted average position of the ground-truth peaks, 2.9013 Å. The program also returned a multiplicity of 56.9680, very close to the sum of the true multiplicities of the unresolved peaks, 57. The second and fourth peaks were also unresolved doublets. The widest unresolved splitting was 0.075 Å. The peak at 4.7640 Å and the unresolved doublet at 4.99 Å were successfully resolved and they are a little over 0.2 Å apart, very close to the expected resolution of data with a $Q_{max}=30~{}\text{\AA}^{-1}$ [farro;prb11].

In general, we may be working with data that were measured at lower real-space resolutions, for example, $Q_{\max}=23~{}\text{\AA}^{-1}$ , which we also tested. Smaller $Q_{\max}$ results in less resolution but also in termination ripples, or oscillations, that might confuse the extraction process. In the initial guess stage, our program finds 36 peaks. In fact, most of them are from termination ripples. However, after the optimization stage, a subset of the peak amplitudes are close to 0. In Table 2, we filter these distances programmatically, and only list the peaks which return a multiplicity larger than 1 and compare them with the extracted peaks when $Q_{\max}=30~{}\text{\AA}^{-1}$ . The quality of the extraction is worse for $Q_{\max}=23~{}\text{\AA}^{-1}$ compared to $Q_{\max}=30~{}\text{\AA}^{-1}$ , but it is still quite reasonable, containing a small number of false positives with very little weight. The program still successfully resolved the peaks at 4.76 Å and 4.99 Å but this time it misassigned some weight between these two peaks. However, all in all, it was a satisfactory extraction.

An example with experimental data.

Next we test our distance extraction algorithm on experimental PDF data collected from 144-atom gold nanoclusters capped with 60 thiolate staples, Au144(SC6)60. Sample preparation is described in [QianAmbientSynthesisAu1442011] and data acquisition and processing to obtain the PDF are described in [jense;nc16]. A well-established DFT structure model exists for this sample [Lopez-AcevedoStructureBondingUbiquitous2009] (LA model) which was previously shown to be in good agreement with the measured PDFs [jense;nc16, BanerjeeImprovedModelsMetallic2018a]. The relaxed LA cluster structure is complex, with low symmetry chiral and staple arrangements of shell atoms on top of a higher symmetry Mackay icosahedral core. In Figure 6 we show the PDF resulting from the initial guess (a, c) and final fitting (b, d) to extract distances from the experimental $\hat{F}(Q)$ and $\hat{G}(r)$ curves. A $Q$ -range from $Q_{\min}=0.8~{}\text{\AA}^{-1}$ to $Q_{\max}=25~{}\text{\AA}^{-1}$ was used for the PDF transformation, and a 4th order derivative was used to obtain initial guesses for $r_{i}$ and $\sigma_{i}$ .

To improve the extraction, we provided an additional constraint such that $\{\sigma_{i}\leq 0.16~{}\text{\AA}\}$ to bound all the $\sigma_{i}$ ’s. Values above this bound would yield isotropic atomic displacement parameters (ADPs) greater than $\sim$ 0.025 Å2, which are unphysically large for homogeneous, atomically precise nanocluster samples. The fitted PDF and $F(Q)$ of the peak models converges nicely to the data, as evident in Figure 6. The fit of the extracted PDF to the measured one indicates we have obtained good convergence, but it is not a measure of the quality of the extraction, which is rather determined by how well the extracted distances agree with the actual ones. In Figure 6(e) and (f) we show histograms of the actual, and optimized-extracted distances, respectively from the Au144(SC6)60 experimental data. A visual comparison suggests that the distribution of distances from the experimental data is very similar to the true distance histogram obtained from the LA structure solution; the overall shape and many of the fine features match well, albeit more coarse-grained for the extracted peaks due to unresolved overlapping peaks. A truncated list ( $2.68<r<7.81$ Å) of the extracted peak parameters ( $r_{i}$ , $m_{i}$ ) is provided in Table 4, in addition to the optimized $\sigma_{i}$ values needed to calculate $\hat{F}(Q)$ and $\hat{G}(r)$ . Due to the dense distribution of true distances, we assigned each extracted distance to a group of true distances, and summed the multiplicities per bin as the total multiplicity. To determine estimated multiplicities from the experimental Au144(SC6)60 data, we performed an additional extraction from data simulated from the LA structure model. The simulated PDF was generated with $Q_{\min}=0.8~{}\text{\AA}^{-1}$ , $Q_{\max}=25~{}\text{\AA}^{-1}$ and all $\sigma_{i}=0.1~{}\text{\AA}$ . This normalization results in good agreement between the experimental and simulated multiplicities from the LA model. The results from the simulated extraction are also provided in Table 4 next to the experimental extraction. This is a challenging low-symmetry nanostructure, but the auto extraction nonetheless seems to be working well.

6 Conclusion

In this paper, we have proposed an algorithm to extract distance lists from a target PDF with no a priori structural information. We use a mathematical model utilizing the sum of Guassians nature of the Debye scattering equation and the PDF to automatically recover peak position, and therefore interatomic distance information. It firstly uses an automated approach to find an initial guess for all the variables and then solves a global optimization problem. The preliminary tests show the effectiveness of the initial guess and good performance and accuracy of the extraction. The approach has been successfully tested on PDFs simulated from known nanoparticle clusters as well as from a challenging low-symmetry experimental dataset.

\ack

Acknowledgment

The authors thank Chia-Hao Liu and Yunzhe Tao for useful discussions. S.B and S.J.B. also thank Christopher J. Ackerson and Kirsten Marie Jensen for the synthesis and characterization of cluster samples. This research is supported by the U.S. National Science Foundation (NSF) through grant DMREF-1534910. S.B. acknowledges support from the National Defense Science and Engineering Graduate Fellowship (DOD-NDSEG) program. Data collected at the Advanced Photon Source at Argonne National Laboratory was supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences (DOE-BES), under contract number DE-AC02-06CH11357.

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] \harvarditem [Banerjee et al. ]Banerjee, Liu, Lee, Kovyakh, Grasmik, Prymak, Koenigsmann, Liu, Wang, Abeykoon, Wong, Epple, Murray \harvardand Billinge 2018 Banerjee Improved Models Metallic 2018 a Banerjee, S., Liu, C.-H., Lee, J. D., Kovyakh, A., Grasmik, V., Prymak, O., Koenigsmann, C., Liu, H., Wang, L., Abeykoon, A. M. M., Wong, S. S., Epple, M., Murray, C. B. \harvardand Billinge, S. J. L. \harvardyearleft 2018 \harvardyearright . J. Phys. Chem. C , \volbf 122(51), 29498–29506.
2[2] \harvarditem Billinge \harvardand Kanatzidis 2004 billinge 2004 beyond Billinge, S. J. \harvardand Kanatzidis, M. \harvardyearleft 2004 \harvardyearright . Chemical communications , (7), 749–760.
3[3] \harvarditem Billinge \harvardand Levin 2007 billinge 2007 problem Billinge, S. J. \harvardand Levin, I. \harvardyearleft 2007 \harvardyearright . Science , \volbf 316(5824), 561–565.
4[4] \harvarditem Billinge 1998 billi;b;lsfd 98 Billinge, S. J. L. \harvardyearleft 1998 \harvardyearright . In Local Structure from Diffraction , edited by S. J. L. Billinge \harvardand M. F. Thorpe, p. 137. New York: Plenum.
5[5] \harvarditem [Billinge et al. ]Billinge, Duxbury, Gonçalves, Lavor \harvardand Mucherino 2016 duxbu;4or 16 Billinge, S. J. L., Duxbury, P. M., Gonçalves, D. S., Lavor, C. \harvardand Mucherino, A. \harvardyearleft 2016 \harvardyearright . 4OR-Q J Oper Res , \volbf 14, 337–376.
6[6] \harvarditem [Billinge et al. ]Billinge, Duxbury, Gonçalves, Lavor \harvardand Mucherino 2018 billi;4or 18 Billinge, S. J. L., Duxbury, P. M., Gonçalves, D. S., Lavor, C. \harvardand Mucherino, A. \harvardyearleft 2018 \harvardyearright . Ann. Oper. Res. pp. 1–43.
7[7] \harvarditem [Cervellino et al. ]Cervellino, Giannini \harvardand Guagliardi 2006 cervellino 2006 efficient Cervellino, A., Giannini, C. \harvardand Guagliardi, A. \harvardyearleft 2006 \harvardyearright . Journal of computational chemistry , \volbf 27(9), 995–1008.
8[8] \harvarditem [Cliffe et al. ]Cliffe, Dove, Drabold \harvardand Goodwin 2010 cliff;prl 10 Cliffe, M. J., Dove, M. T., Drabold, D. A. \harvardand Goodwin, A. L. \harvardyearleft 2010 \harvardyearright . Phys. Rev. Lett. \volbf 104(12), 125501.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Algorithm for distance list extraction from pair distribution functions

Abstract

keywords:

1 Introduction

2 Mathematical Model of PDF

3 Mathematical Model Analysis and Initial Guess

3.1 Properties

3.2 Initial Guess on a single peak

3.3 Initial guess for the case of overlapped peaks

3.4 Initial guess for the peak amplitude, mim_{i}mi​

4 Optimization

5 Testing the approach

5.1 Test on simulated data.

An example with experimental data.

6 Conclusion

Acknowledgment

3.4 Initial guess for the peak amplitude, $m_{i}$