Shades of Dark Uncertainty and Consensus Value for the Newtonian   Constant of Gravitation

Christos Merkatas; Blaza Toman; Antonio Possolo; Stephan Schlamminger

arXiv:1905.09551·physics.data-an·September 4, 2019

Shades of Dark Uncertainty and Consensus Value for the Newtonian Constant of Gravitation

Christos Merkatas, Blaza Toman, Antonio Possolo, Stephan Schlamminger

PDF

TL;DR

This paper introduces a novel method for deriving a consensus value of the gravitational constant by modeling measurement inconsistencies with latent clusters and shades of dark uncertainty, improving the interpretation of conflicting experimental data.

Contribution

A new procedure for consensus building that models measurement results using latent clusters with different shades of dark uncertainty, tailored to each measurement's placement and reported uncertainties.

Findings

01

Derived a new estimate for G: 6.67408 × 10^{-11} m^3 kg^{-1} s^{-2}

02

Proposed a mixture model accounting for measurement inconsistencies and dark uncertainty

03

Demonstrated the method with measurements of the gravitational constant

Abstract

The Newtonian constant of gravitation, $G$ , stands out in the landscape of the most common fundamental constants owing to its surprisingly large relative uncertainty, which is attributable mostly to the dispersion of the values measured for it in different experiments. This study focuses on a set of measurements of $G$ that are mutually inconsistent, in the sense that the dispersion of the measured values is significantly larger than what their reported uncertainties suggest that it should be. Furthermore, there is a loosely defined group of measured values that lie fairly close to a consensus value that may be derived from all the measurement results, and then there are one or more groups with measured values farther away from the consensus value, some higher, others lower. This same general pattern is often observed in many interlaboratory studies and meta-analyses. In the…

Tables4

Table 1. Table 1: Measurement results for G 𝐺 G used in this study. The top fourteen lines reproduce the entries in Mohr et al. [ 46 , Table XV] , except for JILA-10, which has meanwhile been corrected as described in the text. The bottom two lines contain the results reported by Li et al. [ 39 ] , obtained using the time-of-swing (TOS) method and the angular-acceleration-feedback (AAF) method for the torsion pendulum, which have the smallest associated uncertainties achieved thus far [ 67 ] .

	$G$	$u (G)$
	$/ 10^{- 11} m^{3} {kg}^{- 1} s^{- 2}$
NIST-82	$6.672 48$	$0.000 43$	[41]
TR&D-96	$6.6729$	$0.000 50$	[32]
LANL-97	$6.673 98$	$0.000 70$	[2]
UWash-00	$6.674 255$	$0.000 092$	[25]
BIPM-01	$6.675 59$	$0.000 27$	[59]
UWup-02	$6.674 22$	$0.000 98$	[34]
MSL-03	$6.673 87$	$0.000 27$	[1]
HUST-05	$6.672 22$	$0.000 87$	[40, 28]
UZur-06	$6.674 25$	$0.000 12$	[68]
HUST-09	$6.673 49$	$0.000 18$	[74]
JILA-10	$6.672 60$	$0.000 25$	[52]
BIPM-14	$6.675 54$	$0.000 16$	[57, 58]
LENS-14	$6.671 91$	$0.000 99$	[62]
UCI-14	$6.674 35$	$0.000 13$	[49]
HUST-TOS-18	$6.674 184$	$0.000 078$	[39]
HUST-AAF-18	$6.674 484$	$0.000 078$	[39]

Table 2. Table 2: Normalized residuals computed according to the conventional definition (left panel), and involving the correct denominator (right panel).

	$r$		$r$		$r^{*}$		$r^{*}$
NIST-82	$- 4.2$	UZur-06	$- 0.32$	NIST-82	$- 4.2$	UZur-06	$- 0.34$
TR&D-96	$- 2.8$	HUST-09	$- 4.4$	TR&D-96	$- 2.8$	HUST-09	$- 4.5$
LANL-97	$- 0.44$	JILA-10	$- 6.8$	LANL-97	$- 0.44$	JILA-10	$- 6.8$
UWash-00	$- 0.37$	BIPM-14	$7.8$	UWash-00	$- 0.40$	BIPM-14	$8.0$
BIPM-01	$4.8$	LENS-14	$- 2.4$	BIPM-01	$4.9$	LENS-14	$- 2.4$
UWup-02	$- 0.070$	UCI-14	$0.47$	UWup-02	$- 0.070$	UCI-14	$0.49$
MSL-03	$- 1.6$	HUST-TOS-18	$- 1.3$	MSL-03	$- 1.6$	HUST-TOS-18	$- 1.5$
HUST-05	$- 2.4$	HUST-AAF-18	$2.5$	HUST-05	$- 2.4$	HUST-AAF-18	$2.9$

Table 3. Table 3: Values of the LOO Bayesian model selection criterion (LOOIC), for mixture models with K = 0 , 1 , … , 16 𝐾 0 1 … 16 K=0,1,\dots,16 latent clusters, fitted to the measurements listed in Table 1 . The column corresponding to K = 0 𝐾 0 K=0 pertains to the common mean model, which does not recognize dark uncertainty. The best model has K = 2 𝐾 2 K=2 latent clusters, even if this suggestion is clouded by appreciable uncertainty, u ( LOOIC ) 𝑢 LOOIC u(\text{LOOIC}) .

$K$	0	1	2	3	4	5	6
LOOIC	$- 26.7$	$- 170.8$	$- 173.9$	$- 172.8$	$- 172.0$	$- 172.2$	$- 171.6$
$u (LOOIC)$	74.0	5.2	8.0	7.7	7.9	7.9	8.0
$K$		7	8	9	10	11	12
LOOIC		$- 171.2$	$- 171.6$	$- 171.2$	$- 171.2$	$- 171.2$	$- 171.1$
$u (LOOIC)$		8.2	8.1	8.2	8.3	8.3	8.3
$K$		13	14	15	16
LOOIC		$- 170.9$	$- 170.6$	$- 170.7$	$- 170.6$
$u (LOOIC)$		8.5	8.5	8.5	8.5

Table 4. Table 4: Consensus values, standard uncertainties, and 95 % coverage intervals (Lwr95, Upr95) for G 𝐺 G , and estimates of shades of dark uncertainty ( τ 𝜏 \tau , or τ 1 subscript 𝜏 1 \tau_{1} and τ 2 subscript 𝜏 2 \tau_{2} for BMM) produced by different statistical models and methods of data reduction. BRE = = Birge’s approach with inflation factor that makes Cochran’s Q 𝑄 Q equal to its expected value. BRM = = Birge’s approach with inflation factor equal to its maximum likelihood estimate. BRQ = = Birge’s approach with smallest inflation factor that makes data consistent according to Cochran’s Q 𝑄 Q test. MTE = = Weighted average after expansion of standard uncertainties to achieve normalized residuals with absolute value less than 2. DL = = DerSimonian-Laird with Knapp-Hartung adjustment. ML = = Gaussian maximum likelihood. MP = = Mandel-Paule. REML = = Restricted Gaussian maximum likelihood. STU = = Random effects are a sample from a Student’s t 𝑡 t distribution. BG = = Bayesian hierarchical model from the NIST Consensus Builder [ 37 ] , with estimate of τ 𝜏 \tau set to the median of its posterior distribution. LAP = = Laboratory effects and measurement errors modeled as samples from Laplace distributions [ 66 ] . MM = = Bayesian model using a non-informative Gaussian prior distribution for G 𝐺 G and a uniform prior distribution for the dark uncertainty, implemented in R function uvmeta defined in package metamisc [ 16 ] . LOP = = Linear opinion pool from the NIST Consensus Builder [ 37 ] .

	$G / γ$	$u (G) / γ$	$Lwr95 / γ$	$Upr95 / γ$	$τ / γ$
multiplicative model — birge’s approach
BRE	6.67429	0.00014
BRM	6.67429	0.00013	6.67403	6.67455
BRQ	6.67429	0.00011
MTE	6.67429	0.00015
additive model — conventional
DL	6.67399	0.00025	6.67346	6.67453	0.00056
ML	6.67390	0.00025	6.67341	6.67439	0.00091
MP	6.67380	0.00060	6.67263	6.67497	0.00235
REML	6.67389	0.00026	6.67339	6.67440	0.00095
STU	6.67390		6.67335	6.67440	0.00091
additive model — bayesian
BG	6.67389	0.00027	6.67333	6.67442	0.00095
LAP	6.67408	0.00030	6.67345	6.67471	0.00127
MM	6.67389	0.00029	6.67327	6.67443	0.00101
mixture model					$τ_{1} / γ$	$τ_{2} / γ$
BMM	6.67408	0.00024	6.67350	6.67440	0.0004	0.0011
LOP	6.67377	0.00117	6.67127	6.67577

Equations6

G_{j} = G + λ_{j} + ε_{j} .

G_{j} = G + λ_{j} + ε_{j} .

υ_{j}^{2} = u^{2} (G_{j}) + k = 1 \sum K π_{j, k} τ_{k}^{2},

υ_{j}^{2} = u^{2} (G_{j}) + k = 1 \sum K π_{j, k} τ_{k}^{2},

LOO = j = 1 \sum n lo g q_{- j, K} (D_{j} ∣ D_{- j}),

LOO = j = 1 \sum n lo g q_{- j, K} (D_{j} ∣ D_{- j}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Shades of Dark Uncertainty

and Consensus Value for the

Newtonian Constant of Gravitation

Christos Merkatas [email protected]

Blaza Toman [email protected]

Antonio Possolo [email protected]

Stephan Schlamminger [email protected]

National Institute of Standards and Technology, Gaithersburg, MD, USA

(May 23, 2019)

Abstract

The Newtonian constant of gravitation, $G$ , stands out in the landscape of the most common fundamental constants owing to its surprisingly large relative uncertainty, which is attributable mostly to the dispersion of the values measured for it by different methods and in different experiments, each of which may have rather small relative uncertainty.

This study focuses on a set of measurements of $G$ comprising results published very recently as well as older results, some of which have been corrected since the original publication. This set is inconsistent, in the sense that the dispersion of the measured values is significantly larger than what their reported uncertainties suggest that it should be. Furthermore, there is a loosely defined group of measured values that lie fairly close to a consensus value that may reasonably be derived from all the measurement results, and then there are one or more groups with measured values farther away from the consensus value, some appreciably higher, others lower.

This same general pattern is often observed in many other interlaboratory studies and meta-analyses. In the conventional treatments of such data, the mutual inconsistency is addressed by inflating the reported uncertainties, either multiplicatively, or by the addition of “random effects”, both reflecting the presence of dark uncertainty. The former approach is often used by CODATA and by the Particle Data Group, and the latter is common in medical meta-analysis and in metrology. However, both achieve consistency ignoring how the measured values are arranged relative to the consensus value, and measured values close to the consensus value often tend to be penalized excessively, by such “extra” uncertainty.

We propose a new procedure for consensus building that models the results using latent clusters with different shades of dark uncertainty, which assigns a customized amount of dark uncertainty to each measured value, as a mixture of those shades, and does so taking into account both the placement of the measured values relative to the consensus value, and the reported uncertainties. We demonstrate this procedure by deriving a new estimate for $G$ , as a consensus value $G=$ 6.674,08\text{\times}{10}^{-11}\text{,}{\mathrm{m}}^{3}\text{,}{\mathrm{kg}}^{-1}\text{,}{\mathrm{s}}^{-2} $, with $u(G)=$0.000\,24\text{\times}{10}^{-11}\text{\,}{\mathrm{m}}^{3}\text{\,}{\mathrm{kg}}^{-1}\text{\,}{\mathrm{s}}^{-2}$ .

Keywords: measurement uncertainty, Bayesian, Birge ratio, adjustment, CODATA, random effects, mixture model, Markov Chain Monte Carlo, homogeneity, dark uncertainty

1 Introduction

•

Gravitatem in corpora universa fieri, eamque proportionalem esse quantitati materiæ in singulis

•

Si Globorum duorum in se mutuò gravitantium materia undique, in regionibus quæ à centris æqualiter distant, homogenea sit: erit pondus Globi alterutrius in alterum reciprocè ut quadratum distantiæ inter centra

I. Newton (1687) — Matheseos Professore Lucasiano

Philosophiæ Naturalis Principia Mathematica

Liber Tertius: De Mundi Systemate

The NIST Reference on Constants, Units, and Uncertainty (https://physics.nist.gov/cuu/Constants/) includes a list of 22 “Frequently used constants”, among them the Newtonian constant of gravitation, $G$ , which has the largest relative standard uncertainty among these 22, by far, particularly after the values of the Planck constant $h$ , elementary electrical charge $e$ , Boltzmann constant $k$ , and Avogadro constant $N_{\text{A}}$ were fixed in preparation for the redefinition of the international system of units (SI) [48]. The constant $G$ appears as a proportionality factor in Newton’s law of universal gravitation and in the field equations of Einstein’s general theory of relativity [44].

The surprisingly large uncertainty associated with $G$ is mostly an expression of the dispersion of the values that have been measured for it, which exceeds by far the reported uncertainties associated with the individual measured values. Rothleitner and Schlamminger [64] suggest that “this gives reason to suspect hidden systematic errors in some of the experiments. An alternative explanation is that although the values are reported correctly, some of the reported uncertainties may be lacking significant contributions. The uncertainty budgets can include only what experimenters know and not what they do not know. This missing uncertainty is sometimes referred to as a dark uncertainty” [73].

Speake [69] summarizes the role that $G$ plays in classical and quantum physics, reviews the methods used to measure $G$ in laboratory experiments, and discusses the outstanding challenges facing such measurements, suggesting that improvements in the measurement of length are key to reducing the uncertainty associated with $G$ , but also, somewhat discouragingly, suggesting that, owing to “a multitude of subtle problems”, it may be a forlorn hope ever to achieve mutual agreement to within 10 parts per million.

Klein [33] offers Modified Newtonian Dynamics (MOND) [43] as a striking and provocative explanation for why some measured values should lie as far away from the currently accepted consensus value [46] as they do, and shows how they can be “corrected.”

The principal aim of this contribution is to present a new approach to derive a consensus value from the set of mutually inconsistent measurement results for $G$ that the Task Group on Fundamental Constants of the Committee on Data for Science and Technology (CODATA, International Council for Science) used to produce its most recent recommendation of a value for $G$ [46], together with the two, more recent measurement results reported by Li et al. [39]. The procedure we propose is equally applicable to similar reductions of other, mutually inconsistent data sets obtained in interlaboratory comparisons and in meta-analyses [14, 3].

In Section 2 we review a few, particularly noteworthy measurements that directly or indirectly relate to $G$ , beginning with the measurement of the density of the Earth undertaken by Henry Cavendish. Section 3 is focused on the evaluation of mutual consistency (or, homogeneity) of the measurement results: we review several ways in which mutual consistency has traditionally been gauged, and discuss how multiplicative and additive statistical models may be used to produce consensus values when the measurement results are mutually inconsistent.

Section 4 addresses a common complaint about the use of models where dark uncertainty appears as a uniform penalty that applies equally to all measurement results being combined into the consensus value, regardless of whether the corresponding measured values lie close or far from the consensus value, and motivates an alternative approach.

This novel approach regards the measured values as drawings from a mixture of probability distributions, effectively clustering the measurements into subsets with different levels (shades) of dark uncertainty (Section 5). If $n$ denotes the number of measurements one wishes to blend, then we consider mixtures whose number of components ranges from 1 to $n$ , and use Bayesian model selection criteria to identify the best model. Section 6 presents the results obtained by application of the proposed model to the measurement results available for $G$ .

The conclusions, presented in Section 7, include the observation that advances in the measurement of $G$ involve not only substantive developments in measurement methods, but also in the statistical modeling that informs productive data reductions and enables realistic uncertainty evaluations.

2 Historical retrospective

Cavendish [11, Page 520] lists 29 determinations of the relative density (or, specific gravity) $d_{\oplus}$ of the Earth. The first 6, produced in the experiments of August 5-7, 1797, were made using one particular wire to suspend the wooden arm of the apparatus bearing two small leaden balls. Cavendish found that this wire was insufficiently stiff, and he replaced it with a stiffer wire for the 23 determinations between August 12, 1797 and May 30, 1798 [11, Page 485].

This second group of 23 determinations has average $5.480\text{\,}\mathrm{g}\mathrm{/}\mathrm{c}\mathrm{m}\mathrm{{}^{3}}$ . Cavendish points out that the range of these determinations is $0.75\text{\,}\mathrm{g}\mathrm{/}\mathrm{c}\mathrm{m}\mathrm{{}^{3}}$ , “so that the extreme results do not differ from the mean by more than 0.38, or $\frac{1}{14}$ of the whole, and therefore the density should seem to be determined hereby, to great exactness” [11, Page 521]. Following a brief recapitulation of sources of error discussed amply earlier in the paper, Cavendish [11, Page 522] concludes that “it seems very unlikely that the density of the earth should differ from 5.48 by so much as $\frac{1}{14}$ of the whole.”

Since the standard deviation of those 23 determinations is $0.19\text{\,}\mathrm{g}\mathrm{/}\mathrm{c}\mathrm{m}\mathrm{{}^{3}}$ , the aforementioned $0.38\text{\,}\mathrm{g}\mathrm{/}\mathrm{c}\mathrm{m}\mathrm{{}^{3}}$ (“ $\frac{1}{14}$ of the whole”) may be regarded as an expanded uncertainty for approximate 95 % coverage. (Also in agreement with the conventional, crude estimate of the standard deviation as one fourth of the range, that is $($ 0.75\text{,}\mathrm{g}\mathrm{/}\mathrm{c}\mathrm{m}\mathrm{{}^{3}} $)/4$ $\approx$ $0.19\text{\,}\mathrm{g}\mathrm{/}\mathrm{c}\mathrm{m}\mathrm{{}^{3}}$ in this case [27].)

In other words, Cavendish seems effectively to have regarded the 23 determinations made using the second wire as a sample from the distribution of the measurand, and used an assessment of their dispersion as evaluation of what nowadays we would call standard uncertainty, rather than using anything like the standard deviation of the average of the same 23 determinations, which would have been $\sqrt{23}\approx 4.8$ times smaller than $0.19\text{\,}\mathrm{g}\mathrm{/}\mathrm{c}\mathrm{m}\mathrm{{}^{3}}$ . (It should be noted that none of the terms probable error [5], mean error [21], standard deviation [54], or standard error [78] were in use at the time.)

The mass $m$ of an object lying on the surface of the ellipsoid defined in the World Geodetic System (WGS 84) [51], at geodetic latitude $\varphi$ , satisfies $mg(\varphi)=GM_{\oplus}m/r^{2}(\varphi)$ , where $G$ is the Newtonian constant of gravitation, $M_{\oplus}$ denotes the mass of the Earth, $g(\varphi)$ denotes the theoretical acceleration due to gravity (exclusive of the effect of the centrifugal acceleration due to the Earth’s rotation), and $r(\varphi)$ denotes the Earth’s geocentric radius at latitude $\varphi$ . If $R_{3}=$ 6,371,000.79\text{,}\mathrm{m}$$ denotes the radius of a sphere with the same volume as the WGS 84 ellipsoid [47], then $M_{\oplus}=(4/3)\pi R_{3}^{3}d_{\oplus}$ . Therefore, $G=3g(\varphi)r^{2}(\varphi)/(4\pi R_{3}^{3}d_{\oplus})$ .

Substituting $d_{\oplus}=$ 5480\text{,}\mathrm{k}\mathrm{g}\mathrm{/}\mathrm{m}\mathrm{{}^{3}} $as measured by Henry Cavendish, and $g(\varphi)=$9.812\,004\text{\,}\mathrm{m}\mathrm{/}\mathrm{s}\mathrm{{}^{2}}$ and $r(\varphi)=$ 6,365,097\text{,}\mathrm{m} $for the latitude, $\varphi=$51.4578\text{\,}\mathrm{\SIUnitSymbolDegree}$\,\text{N}$, of Clapham Common, South London, where his laboratory was located, and neglecting the elevation above sea level of the same location (approximately $30\text{\,}\mathrm{m}$), yields $G_{\text{C}}=$6.696\,93\text{\times}{10}^{-11}\text{\,}{\mathrm{m}}^{3}\text{\,}{\mathrm{kg}}^{-1}\text{\,}{\mathrm{s}}^{-2}$ . (Note that the subscript “C” that is used here serves only to indicate the provenance of this estimate of $G$ , not to suggest that the true value of the constant depends on location.) The foregoing value for $g(\varphi)$ was computed according to NIMA [51, Equation (4-1)], and the radius $r(\varphi)$ was computed using the lengths of the semi-major and semi-minor axis of the WGS 84 ellipsoid listed in NIMA [51, Tables 3-1, 3-3].

Since $u(d_{\oplus})/d_{\oplus}=3.5\,\%$ and we take the geometry of WGS 84, and the latitude of Clapham Common, as known quantities, this is also the relative uncertainty associated with $G_{\text{C}}$ . More impressive still is the fact that the error in $G_{\text{C}}$ , relative to the CODATA 2014 recommended value, $G_{2014}=$ 6.674,08\text{\times}{10}^{-11}\text{,}{\mathrm{m}}^{3}\text{,}{\mathrm{kg}}^{-1}\text{,}{\mathrm{s}}^{-2}$$ [46], is only 0.34 %. The comparable relative “errors” associated with the contemporary measured values listed in Table 1 range from $-0.033\,\%$ to $0.023\,\%$ , indicating that, in the intervening 220 years, the worst relative “error” in the determination of $G$ has been reduced by no more than 10-fold.

$G$ was of no concern to Cavendish, and neither did Newton introduce it in the Principia [50]. More than 70 years would have to elapse after Cavendish “weighed the Earth”, before even a particular symbol would be advanced for the gravitational constant — and the symbol at first was “ $f$ ”, not “ $G$ ” [15].

According to Hartmut Petzold (formerly with the Deutsches Museum, Munich, personal communication), the birthday of the expression “gravitational constant” was on one of these three days, December 16-18, 1884: on December 16th, Arthur König and Franz Richarz submitted a handwritten proposal to measure “the mean density of the earth”; two days later Helmholtz presented their proposal to the Royal Prussian Academy of Sciences in Berlin with the modified title “A new method for determining the gravitational constant” [61].

In the evening session of June 8, 1894, of the Royal Institution of Great Britain, Charles Vernon Boys also used the symbol $G$ when he made a presentation on the Newtonian constant of gravitation, and announced $6.6576\text{\times}{10}^{-11}\text{\,}{\mathrm{m}}^{3}\text{\,}{\mathrm{kg}}^{-1}\text{\,}{\mathrm{s}}^{-2}$ as “adopted result” derived from experiments using gold and lead balls in a torsion balance [7]. The relative difference between this determination and CODATA’s $G_{2014}$ is $-0.25\,\%$ .

In this study we focus on the set of measurement results listed in Table 1, which includes the results that CODATA used to produce the 2014 recommended value for $G$ , together with two, more recent determinations. Since some of these results differ from their originally published versions, the following remarks clarify the precise provenance of all the measurement results listed. For the sake of brevity, we use the scale factor $\gamma=10^{-11}$ {\mathrm{m}}^{3}\text{,}{\mathrm{kg}}^{-1}\text{,}{\mathrm{s}}^{-2}$$.

NIST-82

The result published originally, $G/\gamma=6.6726\pm 0.0005$ [41], had not been corrected for an effect caused by the anelasticity of the torsion fiber. The corresponding result listed in Table 1 reflects an anelasticity correction applied by CODATA. It should be noted that the change in the reported uncertainty (down from 0.0005 in the original publication, to the 0.00043 in Table 1) is not a consequence of this correction but results from a refinement of the uncertainty analysis that the authors did between the time when the result was first published and when it was used for the 1986 adjustment of the fundamental physical constants [13].

TR&D-96

Identical to the published measurement result [32].

LANL-97

In 2010, CODATA corrected the result published originally, $G/\gamma=6.6740\pm 0.0007$ [2] to take into account uncertainties in the measurement of the quality factor of the torsion pendulum. The quality factor is needed to calculate the correction caused by the anelastic properties of the fiber.

UWash-00

The measured value listed in the original work [25], $G/\gamma=6.674\,215\pm 0.000\,092$ , is $6\times 10^{-6}$ lower than the value used by CODATA. After the result was published, the authors noticed the omission of a small effect and communicated a corrected value to CODATA. The small effect was caused by a a mass that is mounted on the top of the torsion fiber, and is itself suspended by a thicker fiber. In this experiment, the gravitational torque is counteracted by the inertia of the pendulum in an accelerated rotating frame. The acceleration acts also on the pre-hanger, and its effect must be taken into account. No erratum is publicly available.

BIPM-01

Identical to the published result [59].

UWup-02

Identical to the published result [34].

MSL-03

Identical to the published result [1].

HUST-05

The measurement result published originally in 1999, $G/\gamma=$ 6.6699 $\pm$ 0.0007$$ [40], differs appreciably from the corresponding result used by CODATA. The measured value is lower than its CODATA counterpart, with a relative difference of $3.5\times 10^{-4}$ . However, two needed corrections had not been applied: first, for the gravitational effect of the air that is displaced by the field masses; second, for the density inhomogeneity of the source masses. The result, as updated in 2005, became $G/\gamma=6.672\,3\pm 0.000\,9$ [28], where the updated measured value is larger than CODATA’s, the relative difference being $1.1\times 10^{-5}$ . In 2014, CODATA applied a third correction for the anelasticity of the fiber.

UZur-06

Identical to the published result [68].

HUST-09

Identical to the published result [74].

JILA-10

The authors of the original work [52], which listed $G/\gamma=$ 6.672,34 $\pm$ 0.000,14 $as measurement result, realized that two effects had been miscalculated. In 2018, they sent an erratum to CODATA reporting a corrected value of $G/\gamma=$6.672\,60$\pm$0.000\,25$ . First, the pendulum bob rotates under excursion from the equilibrium position due to a differential stretching of the support wire. The rotation is different in the calibration mode from the measurement mode. The second effect also has to do with the rotation of the bob. If the laser beam is not perfectly centered on the mass centers, a rotation can cause an apparent length change (Abbe effect). The Abbe effect was not properly calculated in the initial publication. These two effects have different signs, yielding a final result that differs relatively only by $+3.9\times 10^{-5}$ from the value in the original publication. An erratum has been submitted for publication in Physical Review Letters.

BIPM-14

The measurement result reported originally $G/\gamma=$ 6.675,45 $\pm$ 0.000,18$$ [57] was superseded by the result listed in an erratum published in 2014 [58]. This was the value used by CODATA. The relative change in value of $-13.5\times 10^{-5}$ was caused by the density inhomogeneity of the source masses. In the original publication, the corresponding correction had inadvertently been applied twice.

LENS-14

Identical to the published result [62].

UCI-14

In the original publication [49], the authors reported a slightly (by $3\times 10^{-6}$ ) smaller value, $G/\gamma=6.674\,33\pm 0.000\,13$ . The reported value is an average of three measurements. The authors used an unweighted average, while CODATA used a weighted average and considered the correlation between the three results.

HUST-TOS-18

Identical to the published result [39].

HUST-AAF-18

Identical to the published result [39].

3 Mutual consistency

A set of measurement results, comprising pairs of measured values and associated standard uncertainties, for example $\{(G_{j},u(G_{j}))\}$ as in Table 1, is said to be mutually consistent (or, homogeneous) when the variability of the measured values is statistically comparable to the reported uncertainties: for example, when the standard deviation of the $\{G_{j}\}$ is practically indistinguishable from the “typical” $\{u(G_{j})\}$ (say, their median).

The standard deviation of the $\{G_{j}\}$ in Table 1 is $0.001\,09\text{\,}{\mathrm{m}}^{3}\text{\,}{\mathrm{kg}}^{-1}\text{\,}{\mathrm{s}}^{-2}$ , while the median of the $\{u(G_{j})\}$ is $0.000\,26\text{\,}{\mathrm{m}}^{3}\text{\,}{\mathrm{kg}}^{-1}\text{\,}{\mathrm{s}}^{-2}$ : the former is 4.2 times larger than the latter, indicating that the measured values are much more dispersed than their associated uncertainties suggest they should be.

This implies that either the different experiments are measuring different measurands, or there are sources of uncertainty yet unrecognized that are not expressed in the reported uncertainties. If the different experiments indeed are measuring the same measurand, then these uncertainties are much too small, and the lurking, yet unrecognized “extra” component is what Thompson and Ellison [73] felicitously have dubbed dark uncertainty because it is perceived only once independent results are inter-compared. Dark uncertainty may derive from a single or from multiple sources of uncertainty.

Cochran’s $Q$ test, which is the conventional chi-squared test of mutual consistency, is very widely used, even if it suffers from important limitations and misunderstandings [26]. For the measurement results in Table 1, the test statistic is $Q=198$ on 15 degrees of freedom: since the reference distribution is chi-squared with 15 degrees of freedom, $\chi_{15}^{2}$ , the $p$ -value of the test is essentially zero, hence the conclusion of heterogeneity.

3.1 Multiplicative models

Birge [6] suggested an approach for the combination of mutually inconsistent measurement results that involves: first, inflating the reported standard uncertainties using a multiplicative inflation factor $\kappa$ sufficiently large to make the results mutually consistent; second, combining the measured values into a weighted average whose weights are inversely proportional to the squared uncertainties. (Note that the value of the inflation factor does not affect the value of the estimate of $G$ , only its associated uncertainty.)

The inflation factor is commonly set equal to the Birge Ratio, $\kappa=R_{\text{B}}=\big{[}\sum_{j=1}^{n}w_{j}(G_{j}-\overline{G})^{2}/(n-1)\big{]}^{\text{\textonehalf}}=3.6$ , where $n=16$ denotes the number of measurement results and $\overline{G}$ denotes their weighted average corresponding to weights $\{w_{j}=1/u^{2}(G_{j})\}$ . This choice of value for $\kappa$ makes Cochran’s statistic equal to its expected value, hence is a method-of-moments estimate. Birge’s approach is used routinely by the Particle Data Group (pdg.lbl.gov) [71], and also by CODATA to produce recommended values for some of the fundamental physical constants [46], $G$ in particular.

The inflation factor $\kappa$ may be determined in many other ways. For example, as the smallest multiplier for the $\{u(G)\}$ that yields a value of the chi-squared statistic as large as possible yet shy of the critical value for the test. For the data in Table 1, and for a test whose probability of Type I error is 0.05 (the probability of incorrectly rejecting the hypothesis of homogeneity), the critical value is 24.996, and the corresponding, smallest inflation factor that achieves homogeneity is $\kappa=2.813$ .

The statistical model underlying the multiplicative adjustment of the uncertainties regards the $j$ th measured value as the true value $G$ plus an error commensurate with $u(G_{j})$ and magnified by $\kappa$ . More precisely, as $G_{j}=G+\kappa\varepsilon_{j}$ , where the $\{\varepsilon_{j}\}$ are modeled as non-observable outcomes of independent Gaussian random variables all with mean 0 but with standard deviations $\{u(G_{j})\}$ . The consequence is that the effective measurement errors $\{\kappa\varepsilon_{j}\}$ are then Gaussian random variables with standard deviations $\{\kappa u(G_{j})\}$ .

The two choices of $\kappa$ reviewed above appear reasonable but are ad hoc (yet another ad hoc choice is discussed in Section 3.2). A principled, and generally preferable alternative is maximum likelihood estimation, whereby the “optimal” consensus value $G$ and inflation factor $\kappa$ maximize a product of Gaussian densities evaluated at the measured values $\{G_{j}\}$ , all with the same mean $G$ , and standard deviations $\{\kappa u(G_{j})\}$ . The idea here is to select values for $G$ and $\kappa$ that render the data “most likely.”

The maximum likelihood estimates derived from the data in Table 1, are $\widehat{G}=$ 6.674,29\text{\times}{10}^{-11}\text{,}{\mathrm{m}}^{3}\text{,}{\mathrm{kg}}^{-1}\text{,}{\mathrm{s}}^{-2}$$ and $\widehat{\kappa}=3.5$ . The evaluations of the associated uncertainties, obtained using the parametric statistical bootstrap [20], are $u(\widehat{G})=$ $0.000\,13\text{\times}{10}^{-11}\text{\,}{\mathrm{m}}^{3}\text{\,}{\mathrm{kg}}^{-1}\text{\,}{\mathrm{s}}^{-2}$ (Table 4, row BRM), and $u(\widehat{\kappa})=0.6$ . A 95 % coverage interval for $\kappa$ ranges from 2.2 to 4.6, thus suggesting that any estimate of the inflation factor $\kappa$ is bound to be clouded by very substantial uncertainty. Figure 1 depicts the results.

Rothleitner and Schlamminger [64] sound a note of despair at the conclusion of their review of the history, status, and prospects for improvement of the measurements of $G$ : “Given the current situation in the measurement of $G$ , it is difficult to see how our knowledge of $G$ can be improved, for example, $\chi^{2}$ will not decrease by adding new experiments, as it is a sum of squares and can increase only with new data. The Birge ratio can decrease by increasing $\sqrt{N-1}$ in the denominator; however, this will be a slow process. If an additional 13 experiments are performed (which could take another 30 years if past experiments are an indication), $R_{\text{B}}$ can be reduced by a factor 1.4 if the values are close to the current average value. It is equally difficult to see how the multiplicative factor that CODATA used to bring all normalized residuals below two can be decreased. Thus, decreasing the current uncertainty assigned to the recommended value of G does not seem to be possible — at least, not in the foreseeable future.”

Although we agree that reducing the uncertainty associated with $G$ is an outstanding challenge in precision measurement, we believe that conventional metrics for mutual inconsistency, be they Cochran’s $Q$ or the Birge ratio, are not the most informative means to gauge progress or lack thereof, and that more productive avenues for data reduction are available as we shall illustrate forthwith. Furthermore, in Section 3.2 we show that the goal of bringing “all normalized residuals below two” is excessively restrictive, hence ought not to be used as a quality criterion whereon to judge the mutual consistency of any collection of measurement results.

3.2 Normalized residuals

Mohr and Taylor [45] introduce the notion of “normalized residual” in the context of the nonlinear least squares method that CODATA has been using to derive adjusted values of the fundamental constants $z_{1},\dots,z_{M}$ from a collection of measured values $q_{1},\dots,q_{N}$ of quantities that are functionally related to those constants by measurement equations $\{q_{i}=f_{i}(z_{1},\dots,z_{M})\}$ , where the $\{f_{i}\}$ are determined by the laws of physics and $N>M$ . The normalized residual corresponding to $q_{i}$ is $r_{i}=(q_{i}-\widehat{q}_{i})/u_{i}$ , with $u_{i}=u(q_{i})$ the standard uncertainty associated with $q_{i}$ , and $\widehat{q}_{i}=f_{i}(\widehat{z}_{1},\dots,\widehat{z}_{M})$ , where $\widehat{z}_{1},\dots,\widehat{z}_{M}$ denote the adjusted values of the constants.

For the 2014 adjustment of the value of $G$ , “the Task Group decided that it would be more appropriate to follow its usual approach of treating inconsistent data, namely, to choose an expansion factor that reduces each $|r_{i}|$ to less than 2” [46]. The idea is aligned with Birge’s approach, that $u_{i}$ should be replaced by $\kappa u_{i}$ , where $\kappa$ is the aforementioned expansion factor, thereby reducing the magnitude of the residuals. The adjusted value is the weighted average of the $\{G_{j}\}$ , with weights proportional to $1/(\kappa u_{i})^{2}$ .

Applying this same procedure to the measurement results listed in Table 1 yields an expansion factor $\kappa=3.9$ . The corresponding estimate of $G$ is $6.674\,29\text{\times}{10}^{-11}\text{\,}{\mathrm{m}}^{3}\text{\,}{\mathrm{kg}}^{-1}\text{\,}{\mathrm{s}}^{-2}$ , with associated standard uncertainty $0.000\,15\text{\times}{10}^{-11}$ ${\mathrm{m}}^{3}\text{\,}{\mathrm{kg}}^{-1}\text{\,}{\mathrm{s}}^{-2}$ (Table 4, row MTE).

This approach is justified by the belief that the $\{r_{i}\}$ should be approximately like a sample from a Gaussian distribution, which Figure 2 indeed supports. There are, however, two issues with this approach to achieve mutual consistency.

First, and this is the minor issue, the denominator of $r_{j}=(G_{j}-\overline{G})/u_{j}$ should be $u(G_{j}-\overline{G})$ , not $u(G_{j})$ , because the latter does not recognize the uncertainty associated with $\overline{G}$ or the correlation between $G_{j}$ and $\overline{G}$ . Table 2 lists the values of the normalized residuals $\{r_{j}\}$ as defined conventionally, and their counterparts $\{r_{j}^{\ast}=(G_{j}-\overline{G})/u(G_{j}-\overline{G})\}$ involving the correct denominator (evaluated using the parametric statistical bootstrap [20]). The differences between corresponding values indeed are minor and largely inconsequential in this case.

Second, and this is the major issue, if the (properly) normalized residuals indeed are like a sample of size $n$ from a Gaussian distribution with mean 0 and standard deviation 1, then, according to the Fisher-Tippett-Gnedenko theorem [24], the expected value of the largest residual is approximately equal to $(1-\gamma)\Phi^{-1}(1-1/n)+\gamma\Phi^{-1}(1-1/(en))$ , where $\Phi^{-1}$ denotes the quantile function (inverse of the probability distribution function) of the Gaussian distribution with mean 0 and standard deviation 1, $e\approx$ 2.718,282 $is Euler’s number, and $\gamma\approx$0.577\,215\,7$ is the Euler-Mascheroni constant. This expected value increases with $n$ , and it is already 1.8 for $n=16$ .

Furthermore, when there are $n=16$ normalized residuals, and the data are mutually consistent and the underlying statistical model applies, the odds are better than even (53 % probability, in fact) that at least one will have absolute value greater than 2. Therefore, and in general, requiring that all normalized residuals, after application of the expansion factor, should have absolute values less than 2 leads to excessively large expansion factors.

3.3 Additive models

An alternative treatment, which indeed is the most prevalent approach to blend independent measurement results, from medicine to metrology, including when the results are mutually inconsistent, involves an additive model for the measured values, of the form

[TABLE]

This model acknowledges the possibility that the different experiments may be measuring different quantities, by introducing experiment effects $\{\lambda_{j}\}$ such that, given $\lambda_{j}$ , the expected value of $G_{j}$ is $G+\lambda_{j}$ . The standard deviation of the measurement error $\varepsilon_{j}$ is the reported uncertainty, $u(G_{j})$ . Since the experiment effects may be indistinguishable from zero, this model can also accommodate mutually consistent data.

On first impression, it may seem that the model is non-identifiable: that by making $\lambda_{j}$ large and $\varepsilon_{j}$ small, or vice-versa, the same value of $G_{j}$ may be reproduced. However, the fact that the data are not only the $\{G_{j}\}$ but also the $\{u(G_{j})\}$ , resolves the potential ambiguity: since the $\{\varepsilon_{j}\}$ are comparable to their corresponding $\{u(G_{j})\}$ , if the $\{G_{j}\}$ turn out to be appreciably more dispersed than the $\{u(G_{j})\}$ intimate, then this suggests that the $\{\lambda_{j}\}$ cannot all be zero.

The most common modeling assumption is that the $\{\lambda_{j}\}$ are a sample from a Gaussian distribution with mean 0 and standard deviation $\tau$ , which quantifies the dark uncertainty. Koepke et al. [36] discuss several variants of this random effects model, and describe procedures to fit them to measurement data. Some of these procedures are implemented in the NIST Consensus Builder, which is a Web-based application publicly and freely available at https://consensus.nist.gov [37].

The DerSimonian-Laird procedure to fit random effects models to measurement data is used most commonly in meta-analysis in medicine [18, 19]. This procedure yields the conventional weighted mean when the estimate of dark uncertainty is 0.

The version of the DerSimonian-Laird procedure implemented in the NIST Consensus Builder estimates $G$ as $6.673\,99\text{\times}{10}^{-11}\text{\,}{\mathrm{m}}^{3}\text{\,}{\mathrm{kg}}^{-1}\text{\,}{\mathrm{s}}^{-2}$ , with associated standard uncertainty $u(G)=$ $0.000\,25\text{\times}{10}^{-11}\text{\,}{\mathrm{m}}^{3}\text{\,}{\mathrm{kg}}^{-1}\text{\,}{\mathrm{s}}^{-2}$ (including the Knapp-Hartung adjustment [35]), and dark uncertainty $\tau_{\text{DL}}=$ 0.000,56\text{\times}{10}^{-11}\text{,}{\mathrm{m}}^{3}\text{,}{\mathrm{kg}}^{-1}\text{,}{\mathrm{s}}^{-2}$$. These results are depicted in Figure 3, and appear in Table 4, row DL.

Rukhin and Possolo [66] propose a version of the random effects model where both the experiment effects $\{\lambda_{j}\}$ and the measurement errors $\{\varepsilon_{j}\}$ are samples from two different Laplace distributions. The consensus value in this case is a weighted median, $6.674\,08\text{\times}{10}^{-11}\text{\,}{\mathrm{m}}^{3}\text{\,}{\mathrm{kg}}^{-1}\text{\,}{\mathrm{s}}^{-2}$ , with associated standard uncertainty $0.000\,30\text{\times}{10}^{-11}\text{\,}{\mathrm{m}}^{3}\text{\,}{\mathrm{kg}}^{-1}\text{\,}{\mathrm{s}}^{-2}$ . The corresponding estimate of dark uncertainty is $\tau_{\text{LAP}}=$ 0.001,27\text{\times}{10}^{-11}\text{,}{\mathrm{m}}^{3}\text{,}{\mathrm{kg}}^{-1}\text{,}{\mathrm{s}}^{-2}$$ (Table 4, row LAP).

Several other versions of the additive random effects model are implemented in various packages for the R environment for statistical data analysis and graphics [60], including: metafor [76] (used to produce the estimates of $G$ labeled ML, MP, and REML in Table 4); metaplus [4] (for estimate STU in Table 4); and metamisc [16] (for estimate MM in Table 4), among many others.

4 Shades of Dark Uncertainty

A comparison of Figures 1 and 3, and of the underlying models and corresponding numerical results, reveals important and obvious differences, as well as two noteworthy commonalities: (i) the consensus values, although numerically different, neither differ significantly from one another once their associated uncertainties are taken into account, nor do they differ significantly from the 2014 CODATA recommended value, even though both incorporate measurement results (HUST-TOS-18 and HUST-AAF-18) that were not yet available when this recommended value was produced, as well as the corrected result for JILA-10; (ii) both penalize the effective uncertainty of the individual measurement results uniformly, albeit one differently from the other.

The penalty applies regardless of how the measured values are situated relative to the consensus value, and regardless also of whether the reported uncertainties are small or large. For example, in Figure 3 one might have expected JILA-10 and BIPM-14 to have been penalized with appreciably larger components of dark uncertainty than UZur-06 or HUST-09.

Figure 1 reveals other, possibly even less palatable anomalies, which are specific to the multiplicative inflation of the reported uncertainties: in particular, that the results from LANL-97, UWup-02, HUST-05, and LENS-14, end-up contributing essentially nothing to the consensus value.

The pattern of the measurement results depicted in Figure 3 is fairly typical: on the one hand, there is a cluster of results (including UZur-06 and HUST-TOS-18) that, all by themselves, would be mutually consistent and indeed have measured values that lie quite close to the consensus value; on the other hand, there is another cluster (including BIPM-01, JILA-10 and BIPM-14) whose measured values lie much farther afield, to either side of the consensus value.

To increase the flexibility of additive random effects models, in particular to enable them to cope with such mixed bag of results, and to alleviate the inequities arising from applying the same dark uncertainty penalty to all the results, regardless of how they are situated relative to the consensus value, we have developed a new model that yields different evaluations of dark uncertainty for different subsets of the measurement results. We call the corresponding, different $\tau$ s, shades of dark uncertainty.

This new model, which we introduce in the next Section, represents the probability distributions of the measured values as mixtures of distributions, similarly to how the linear opinion pool, implemented in the NIST Consensus Builder [36], represents them. (The results of applying the linear opinion pool to the data in Table 1 are labeled LOP in Table 4.)

For a simple example of a mixture, consider two dice: one is cubic with faces numbered 1 through 6; the other is dodecahedral with faces numbered 1 through 12; the faces of each die are equally likely to land up when the die is rolled. Suppose that one die is chosen at random so that the cubic die is twice as likely to be chosen as the dodecahedral die, and then it is rolled. The probability distribution of the outcome is a mixture of two discrete, uniform distributions: the probability of a four is $(2/3)\times(1/6)+(1/3)\times(1/12)=5/36$ .

And if one is told that a four turned up, but not which die was rolled, then one can use Bayes rule [17, 56] to infer that it was the cubic die with $((1/6)\times(2/3))/(5/36)=$ 80 % probability. Given the results of multiple realizations of this procedure (choosing a die at random and rolling this die), one may then compute the probabilities of the outcomes having originated in the cubic die. Those outcomes for which this probability is greater than 50 % may be said to form one cluster, and the others a different cluster.

5 Bayesian mixture model

The mixture model that we propose is parametric and Bayesian, and depends on the number, $K$ , of shades of dark uncertainty to be entertained. “Parametric” means that all probability distributions are determined by a finite number of scalar parameters. “Bayesian” means that the data ( $\{(G_{j},u(G_{j})\}$ ) are modeled as observed values of random variables, that the unknowns (true value of $G$ , probabilities of membership in the latent clusters, and shades of dark uncertainty) are modeled as non-observable random variables, and that the information the data hold about the unknowns is extracted by application of Bayes’s rule and distilled into the posterior distribution of the unknowns (which is the conditional distribution of the parameters given the data).

Subsection 5.1 characterizes the model given the number, $K$ , of components in the mixture, and Subsection 5.2 describes how a value for $K$ is chosen automatically, from among the models corresponding to $K=1,2,\dots,n$ , so that the procedure produces the “best” model, according to a Bayesian model selection criterion.

5.1 Model definition

Mixture models do not actually partition the measured values into clusters, each with its own shade of dark uncertainty. Instead, each measured value belongs to all the latent clusters simultaneously, but typically with rather different probabilities of belonging to each one of them. This fuzzy reality notwithstanding, it is often a useful simplification to say that a measured value belongs to the latent cluster that it has the largest posterior probability of belonging to. Accordingly, and to present the results vividly, in Section 6 we “assign” each measurement to the latent cluster that the measurement has the largest posterior probability of belonging to — the so-called maximum a posteriori estimate (MAP) of cluster membership.

The $K$ distributions being mixed (which define the latent clusters) are Gaussian, and they have different standard deviations, which are the shades of dark uncertainty, $\tau_{1},\dots,\tau_{K}$ . The results include an estimate of $G$ , an evaluation of the associated uncertainty, estimates of the $\{\tau_{k}\}$ , as well as the identification of the latent cluster that each measurement result most likely belongs to.

Since the model is Bayesian and will be fit to the measurement results via Markov Chain Monte Carlo (MCMC) [23], not only estimates and standard uncertainties, but also coverage (credible) intervals, may easily be derived for all the parameters in the model: $G$ , the $\{\tau_{k}\}$ , and the cluster membership probabilities $\bm{\pi}_{j}=(\pi_{j,1},\dots,\pi_{j,K})$ , where $\pi_{j,K}=1-(\pi_{j,1}+\dots+\pi_{j,K-1})$ , for $j=1,\dots,n$ , and $\pi_{k,j}$ denotes the probability that measurement $j$ belongs to cluster $k$ , for $k=1,\dots,K$ . Therefore, the model corresponding to a particular value of $K$ has $1+K+n(K-1)$ parameters.

The reported uncertainties $\{u(G_{j})\}$ , even though they are data, are treated as known quantities on the assumption that they are based on infinitely many degrees of freedom. In cases where they are not, the model can easily be modified to accommodate the finite numbers of degrees of freedom that the $\{u(G_{j})\}$ may be based on.

The model is hierarchical [22]: (i) given $G$ , the $\{\tau_{k}\}$ , and the $\{\bm{\pi}_{j}\}$ , the measured values are modeled as observed outcomes of Gaussian random variables, with $G_{j}$ having a Gaussian distribution with mean $G$ and standard deviation $\upsilon_{j}$ such that

[TABLE]

for $j=1,\dots,n$ ; (ii) $G$ has an essentially non-informative Gaussian prior distribution with mean $G_{2014}$ and large variance; (iii) the $\{\tau_{k}\}$ have mildly informative half-Cauchy distributions whose medians have to be specified; and (iv) the $\{\bm{\pi}_{j}\}$ have the same flat Dirichlet prior distribution (all concentration parameters set equal to 1) [38, Chapter 49]. Furthermore, $G$ , the $\{\tau_{k}\}$ , and the $\{\bm{\pi}_{j}\}$ are mutually independent a priori. Equation (2) makes precise the sense in which the effective dark uncertainty for each measurement result is a mixture of shades of dark uncertainty.

We implemented this model in the JAGS language [55], and then used the implementation in R function jags defined in package R2jags [70], to produce samples from the distribution of all the parameters via MCMC.

5.2 Model selection

Since the mixture representation of the dark uncertainty that appears in the second term on the right-hand side of Equation (1) involves latent clusters and not a partition of the measurements into actual clusters, in principle there is no constraint on the number, $K$ , of latent clusters. However, common sense dictates that there ought not to be more than the number, $n$ , of measurements being combined, hence $1\leqslant K\leqslant n$ .

We consider the $n$ models corresponding to $K=1,\dots,n$ in turn, and use each one to predict the value of $G$ that a future, independent experiment may produce. The we choose the model that makes the most accurate predictions. To be able to explain how this is done, even if we omit all of the technical details, we need to introduce some notation.

Let $D$ denote the data in hand ( $n$ measured values and their associated uncertainties), and $\theta$ denote the parameters in the model defined in Subsection 5.1, with $K$ latent clusters. Therefore, $\theta$ includes the unknown value of the Newtonian constant of gravitation, $G$ , the shades of dark uncertainty $\{\tau_{k}\}$ , and the probabilities, $\{\pi_{j,k}\}$ , of membership in the latent clusters. The probability density of the data given the parameters is $f_{K}(D|\theta)$ , and $p(\theta)$ is the prior probability density of the parameters. The density of the posterior distribution of the parameters given the data is given by Bayes’s rule [17]: $q_{K}(\theta|D)=f_{K}(D|\theta)p(\theta)/g_{K}(D)$ , where $g_{K}(D)=\int f_{K}(D|\theta)p(\theta)\mathrm{d}\theta$ , and the integral is over the set of possible values of the parameters.

Our goal is to select the value of $K$ for which $h_{K}(D^{\ast}|D)$ is largest, where $D^{\ast}$ denotes a future measurement, and $h_{K}$ is the predictive posterior density defined as $h_{K}(D^{\ast}|D)=\int f_{K}(D^{\ast}|\theta)q_{K}(\theta|D)\mathrm{d}\theta$ [23]. Since this future observation $D^{\ast}$ is speculative (hence, unknown), the best we can do is estimate $h_{K}(D^{\ast}|D)$ pretending that $D^{\ast}$ is one of the results that we have, and that $D$ comprises all the results that we have except that one.

For model selection, we rely on the Bayesian Leave-One-Out cross validation score (LOO), which gauges the posterior predictive acumen of the model under consideration. To compute it, the model is fitted to $D_{-j}$ (all the measurements except the $j$ th), and the corresponding predictive density is evaluated at $D_{j}$ (the measurement left out, here playing the role of future, independent measurement), this process being repeated for $j=1,\dots,n$ . Thus, for each number of latent clusters $K$ , the model is fitted $n$ times, producing $n$ posterior densities $q_{-1,K},\dots,q_{-n,K}$ , each based on $n-1$ measurements, and $\log h_{K}(D^{\ast}|D)$ is estimated by the cross-validated predictive accuracy score

[TABLE]

which we then transform into the LOO Information Criterion, $\text{LOOIC}=-2\times\text{LOO}$ , which is numerically comparable to Akaike’s Information Criterion (AIC), a widely used model selection criterion [8].

Since determining each $q_{-j,K}$ involves an MCMC run, the procedure outlined in the previous paragraph requires $nK$ MCMC runs. However, R package loo [75] offers a shortcut to this onerous procedure and produces an approximation to the foregoing average of values of log posterior densities using the results of a single MCMC run.

Since the LOOIC involves the data and MCMC sampling, it is surrounded by uncertainty, which we have evaluated using R function loo defined in the package of the same name. In general, the smaller the LOOIC, the better the model. However, differences between values of LOOIC have to be interpreted taking their associated uncertainties into account, as we will explain in Section 6.

5.3 Similar models

There is a growing collection of models whose purpose and devices are similar to the model we described above. Here we mention only a few of these alternatives.

Burr and Doss [10] describe a Bayesian semi-parametric model for random-effects meta-analysis in the form of a Dirichlet mixture, which is implemented in R package bspmma [9].

Jara et al. [31] present Bayesian non-parametric and semi-parametric models for a wide range of applications, including for linear, mixed-effects models used in meta-analysis, using a Dirichlet process prior distribution, or a mixture of Dirichlet process prior distributions [72], for the distribution of the random effects. Both R packages DPpackage [30, 31] and dirichletprocess [63] facilitate the use of these priors.

Jagan and Forbes [29] propose adjusting (typically inflating) each reported uncertainty just enough to achieve mutual consistency, with the adjustments obtained by minimization of a relative entropy criterion. The results may be interpreted as involving estimates of dark uncertainty that are tailored for each measurement result individually.

Our proposal and Rukhin [65]’s are similar in that they both model the additional uncertainty directly, and not through the distribution of the random effects as is done in most other models. The main differences between our approach and Rukhin [65]’s are the following:

•

Our mixture model comprises latent clusters, and each measurement may belong to all the clusters simultaneously, possibly with different probabilities, hence its effective dark uncertainty is a mixture of the shades of dark uncertainty of the latent clusters; Rukhin [65] partitions the measurements into clusters and assigns a particular, same value of dark uncertainty to all the measurements in the same cluster.

•

Rukhin [65] assumes that the measurements in one of the clusters are mutually consistent, hence that it has no dark uncertainty (the “null” cluster). In most cases there will be multiple clusters whose measurements are mutually consistent, and the results may depend on which one is chosen to play the role of “null” cluster.

6 Results

Table 3 lists the values of the model selection criterion LOOIC, and associated uncertainties, for the models corresponding to $K=0,1,\dots,16$ shades of dark uncertainty. The case with $K=0$ is the common mean model, $G_{j}=G+\varepsilon_{j}$ (cf. Equation (1)), which does not recognize dark uncertainty, and is vastly inferior to the models that do recognize it.

As $K$ increases from 1 to $n$ , the LOICC undergoes its largest drop in value from $K=1$ to $K=2$ , where it reaches its minimum, thus suggesting that the best model should have $K=2$ latent clusters. However, the large uncertainties associated with the LOOIC caution that this choice is only nominally better than any other.

One of the reasons why the LOOIC does not achieve a sharp, deep minimum, and instead keeps hovering near its minimum as $K$ increases above 2, is that for some of the larger values of $K$ , the number of effective latent clusters is much smaller than $K$ . For example, when $K=10$ , there are only 5 different MAP estimates of cluster “membership”, that is, 5 different effective latent clusters. Next we explain what we mean by “effective latent clusters.”

In Subsection 5.1 we pointed out that we “assign” each measurement to the latent cluster that the measurement has the largest posterior probability of belonging to — the so-called maximum a posteriori estimate (MAP) of cluster membership: these MAP assignments are reflected in the different colors of the labels in Figures 5 and 6.

Recognizing that the model with $K=2$ , although nominally the best, is not head and shoulders above the other models with $K\geqslant 1$ , we further invoke the general principle that, everything else being just about comparable, one is well-advised to take the simpler model: therefore, we will proceed on the assumption that the best model has $K=2$ latent clusters. This choice is also supported by the fact that the model with $K=2$ assigns clearly smaller amounts of dark uncertainty to UWash-00 and to UZur-06 than to results that are similarly precise, or even more precise, but lie farther away from the consensus value. The model with $K=1$ would be incapable of drawing such distinctions.

The MCMC procedure yielded a sample of size $512\,000$ drawn from the joint posterior distribution of the parameters, resulting from collating every 25th outcome from each of four chains of length $4\text{\times}{10}^{6}$ , with burn-in of $8\text{\times}{10}^{5}$ iterations per chain. Each point in this sample comprises one value for $G$ , values for $\tau_{1}$ and $\tau_{2}$ , and cluster memberships $C_{1},\dots,C_{n}$ , and cluster membership probabilities $\pi_{1,1},\pi_{1,2}=1-\pi_{1,1}$ , $\dots$ , $\pi_{n,1},\pi_{n,2}=1-\pi_{n,1}$ for all the measurements in Table 1.

The upper panel of Figure 4 depicts the posterior distribution of $G$ . The Bayesian estimate of the consensus value was chosen as the mean of the sample drawn from the posterior distribution of $G$ , $6.674\,08\text{\times}{10}^{-11}\text{\,}{\mathrm{m}}^{3}\text{\,}{\mathrm{kg}}^{-1}\text{\,}{\mathrm{s}}^{-2}$ , and the associated standard uncertainty, $0.000\,24\text{\times}{10}^{-11}\text{\,}{\mathrm{m}}^{3}\text{\,}{\mathrm{kg}}^{-1}\text{\,}{\mathrm{s}}^{-2}$ , as the standard deviation of the same sample. The 2.5th and 97.5th percentiles of this sample are the endpoints of a 95 % coverage (credible) interval for the true value of $G$ — their values are listed in the row of Table 4 labeled BMM.

The lower panel of Figure 4 depicts the posterior distributions of the two shades of dark uncertainty, $\tau_{1}$ and $\tau_{2}$ . Their Bayesian estimates, $\widehat{\tau}_{1}=$ 0.0004\text{,}{\mathrm{m}}^{3}\text{,}{\mathrm{kg}}^{-1}\text{,}{\mathrm{s}}^{-2} $and $\widehat{\tau}_{2}=$0.0011\text{\,}{\mathrm{m}}^{3}\text{\,}{\mathrm{kg}}^{-1}\text{\,}{\mathrm{s}}^{-2}$ , were chosen as the medians of their respective MCMC samples because their distributions are markedly asymmetrical (lower panel of Figure 4), with very long right tails.

Figure 5 depicts the medians of the posterior probabilities of cluster membership, showing that for only a few of the measurement results (for example, BIPM-14 and UZur-06) is membership in one of clusters clearly more likely than membership in the other. HUST-AAF-18 is just about as likely to belong to one cluster as to the other, the difference favoring membership in cluster 1 (which has the smallest shade of dark uncertainty) by the narrowest of margins.

This fact helps explain why, as shown in Figure 6, the dark uncertainty assigned to HUST-AAF-18 is closer to the dark uncertainty assigned to NIST-82 than to the dark uncertainty assigned to HUST-TOS-18, even though, on the one hand, the standard uncertainty reported for HUST-AAF-18 is quite similar to the standard uncertainty reported for HUST-TOS-18, and on the other hand HUST-AAF-18 lies much closer to the consensus value than NIST-82. The reason is that cluster membership is determined by the distance to the consensus value gauged in terms of the reported standard uncertainty: from this viewpoint HUST-AAF-18 is just about as far from the consensus value as NIST-82, and so much farther from it than HUST-TOS-18.

Figure 6 depicts the data and the results of fitting our mixture model to them. The meaning of the thick and thin vertical blue lines is similar to the meaning that they have in Figure 3. A word of explanation is in order for how the lengths of the thin lines were determined. The thin line centered at $G_{j}$ represents $G_{j}\pm\upsilon_{j}$ , where $\upsilon_{j}$ was defined in Equation (2). However, this is not how the $\{\upsilon_{j}\}$ were computed.

The approach we took for computing $\upsilon_{j}$ tracks the actual way in which MCMC unfolds, as closely as possible: each time the MCMC process generates an acceptable sample, it provides a cluster membership $k_{j}\in\{1,2\}$ for $G_{j}$ , and produces also values for $\tau_{1}$ and $\tau_{2}$ for the latent clusters: the corresponding value of $\upsilon^{2}_{j}$ is $u^{2}(G_{j})+\tau^{2}_{k_{j}}$ . The value used for $\upsilon_{j}$ in this Figure is the square root of the median of the $512\,000$ samples of $\upsilon^{2}_{j}$ computed as just described, for each $j=1,\dots,n$ .

7 Conclusions

The mixture model introduced in Section 5, driven by the model selection criteria outlined in Subsection 5.2, by and large achieved one of the central goals of this contribution: to impart flexibility to the conventional laboratory effects model, in particular addressing successfully the long-standing grievance resulting from penalizing (with dark uncertainty) measured values lying close to the consensus value as severely as those that lie farther afield.

This model truly excels in producing shades of dark uncertainty that are finely attuned to the structure of the data, in particular to how the measured values are arranged relative to one another and relative to the consensus value, while taking their associated uncertainties into account, as Figure 6 shows. Furthermore, the model does all this without widening the uncertainty associated with the consensus value.

Table 4 summarizes consensus values for $G$ , and expressions of the associated uncertainty, that were derived from the set of measurement results listed in Table 1, by various methods described in the foregoing, and in particular by the new method that we have described (denoted BMM in this table).

It should be noted that the four variants of the multiplicative model (BRE, BRM, BRQ, and MTE), all produce slightly larger estimates of $G$ than the additive models and the mixture models. Both LAP [66] and MP [42, 53] yield evaluations of dark uncertainty appreciably larger than the other methods. The Bayesian mixture model (BMM) produces modest shades of dark uncertainty because it capitalizes on smart, “soft” clustering of the measurement results to explain the overall dispersion of the measured values above and beyond what their associated, reported uncertainties suggest.

In this contribution we have argued in favor of model-based approaches to consensus building (as opposed to ad hoc approaches like those that are driven by the Birge ratio or by the sizes of the absolute values of normalized residuals), particularly when faced with measurement results that are mutually inconsistent.

Additive laboratory random effects models using mixtures, like the model we introduced in Section 5, seem especially promising as they are able to identify subsets of results that appear to express different shades of dark uncertainty, and then weigh them differently in the process of consensus building, yet without disregarding the information provided by any of the results being combined.

We have assembled in Table 4 the results produced by an assortment of methods not only to provide perspective on the new method we are proposing (BMM), but also because arguably any of these methods could reasonably be selected by different professional statisticians working in collaboration with physicists engaged in the measurement of $G$ . Even though the underlying models and specific validating assumptions differ, choosing one or another reflects mostly inessential differences in training, preference, and experience in statistical data modeling and analysis.

However, the variability of these estimates of $G$ , which is attributable solely to differences in approach, model selection, and data reduction technique, amounts to about 78 % of the median of the values of $u(G)$ listed in the same table, and to about 20 % of the median of the values of $\tau$ .

In other words, the statistical “noise” (that is, the vagaries and incidentals that would lead one researcher to opt for a particular model and method of data reduction, and another for a different model and method) is clearly not negligible. Therefore, the development, dissemination, and widespread adoption of best practices in statistical modeling and data analysis for consensus building will be contributing factors in reducing the uncertainty associated with any consensus value that may be derived from an ever growing set of reliable measurement results for $G$ obtained by increasingly varied measurement methods.

Finally, we express our belief that one feature already apparent in the collection of measurement results assembled in Table 1 provides the greatest hope yet for appreciable progress in the years to come: the fact that fundamentally different, truly independent measurement methods have been employed, relying on different physical laws, and yet there has been convergence toward a believable consensus, even if it still falls short of achieving a reduction in relative uncertainty to levels comparable to what prevails for other fundamental constants.

Acknowledgments

The authors are much indebted to their NIST colleagues Eite Tiesinga (Quantum Measurement Division, Physical Measurement Laboratory, and Joint Quantum Institute, University of Maryland), and Amanda Koepke (Statistical Engineering Division, Information Technology Laboratory), who read an early draft, uncovered errors and pointed out confusing passages, and provided many valuable comments and suggestions for improvement. The authors are very grateful to Hartmut Petzold for allowing them to share results of his historical research prior to publication in a forthcoming book.

Bibliography78

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Armstrong and Fitzgerald [2003] T. R. Armstrong and M. P. Fitzgerald. New measurements of G 𝐺 G using the measurement standards laboratory torsion balance. Physical Review Letters , 91:201101, November 2003. doi: 10.1103/Phys Rev Lett.91.201101 .
2Bagley and Luther [1997] C. H. Bagley and G. G. Luther. Preliminary results of a determination of the Newtonian constant of gravitation: a test of the Kuroda hypothesis. Physical Review Letters , 78:2047, 1997.
3Baker and Jackson [2013] R. D. Baker and D. Jackson. Meta-analysis inside and outside particle physics: two traditions that should converge? Research Synthesis Methods , 4(2):109–124, 2013. doi: 10.1002/jrsm.1065 .
4Beath [2016] K. J. Beath. metaplus: An R package for the analysis of robust meta-analysis and meta-regression. R Journal , 8(1):5–16, 2016. URL https://journal.r-project.org/archive/2016-1/beath.pdf .
5Bessel [1815] F. W. Bessel. Ueber den Ort des Polarsterns. In J. E. Bode, editor, Astronomische Jahrbuch für das Jahr 1818 , pages 233–240. Königliche Akademie der Wissenschaften, Berlin, 1815.
6Birge [1932] R. T. Birge. The calculation of errors by the method of least squares. Physical Review , 40:207–227, April 1932. doi: 10.1103/Phys Rev.40.207 .
7Boys [1894] C. V. Boys. The Newtonian constant of gravitation. Proceedings of the Royal Institution of Great Britain , 14(88):353–377, 1894. Friday, June 8.
8Burnham and Anderson [2004] K. P. Burnham and D. R. Anderson. Multimodel inference: Understanding AIC and BIC in model selection. Sociological Methods & Research , 33(2):261–304, November 2004. doi: 10.1177/0049124104268644 .