How did Donald Trump Surprisingly Win the 2016 United States   Presidential Election? an Information-Theoretic Perspective (Clean Sensing   for Big Data Analytics:Optimal Strategies,Estimation Error Bounds Tighter   than the Cram\'{e}r-Rao Bound)

Weiyu Xu; Lifeng Lai; Amin Khajehnejad

arXiv:1812.11891·cs.IT·January 1, 2019

How did Donald Trump Surprisingly Win the 2016 United States Presidential Election? an Information-Theoretic Perspective (Clean Sensing for Big Data Analytics:Optimal Strategies,Estimation Error Bounds Tighter than the Cram\'{e}r-Rao Bound)

Weiyu Xu, Lifeng Lai, Amin Khajehnejad

PDF

Open Access

TL;DR

This paper uses information theory to analyze opinion poll inaccuracies in the 2016 US election and proposes an optimal sensing strategy that balances data quality and quantity under cost constraints.

Contribution

It introduces a general framework for optimal parameter estimation from heterogeneous, distorted data sources, and derives new lower bounds tighter than classical bounds.

Findings

01

Larger sample size does not guarantee better polling accuracy.

02

Optimal resource allocation improves estimation from heterogeneous data.

03

New lower bounds on estimation error are tighter than Cramér-Rao bounds.

Abstract

Donald Trump was lagging behind in nearly all opinion polls leading up to the 2016 US presidential election, but he surprisingly won the election. This raises the following important questions: 1) why most opinion polls were not accurate in 2016? and 2) how to improve the accuracies of opinion polls? In this paper, we study the inaccuracies of opinion polls in the 2016 election through the lens of information theory. We first propose a general framework of parameter estimation, called clean sensing (polling), which performs optimal parameter estimation with sensing cost constraints, from heterogeneous and potentially distorted data sources. We then cast the opinion polling as a problem of parameter estimation from potentially distorted heterogeneous data sources, and derive the optimal polling strategy using heterogenous and possibly distorted data under cost constraints. Our results…

Equations478

f (X_{1}^{1}, X_{2}^{1}, ...,, X_{m_{1}}^{1}, X_{1}^{2}, X_{2}^{2}, ...,, X_{m_{2}}^{2}, ..., X_{1}^{K}, ...,, X_{m_{K}}^{K}, θ, A)

f (X_{1}^{1}, X_{2}^{1}, ...,, X_{m_{1}}^{1}, X_{1}^{2}, X_{2}^{2}, ...,, X_{m_{2}}^{2}, ..., X_{1}^{K}, ...,, X_{m_{K}}^{K}, θ, A)

= f_{1}^{1} (X_{1}^{1}, θ, A) f_{2}^{1} (X_{2}^{1}, θ, A) \dots f_{m_{1}}^{1} (X_{m_{1}}^{1}, θ, A) f_{1}^{2} (X_{1}^{2}, θ, A) \dots f_{m_{2}}^{2} (X_{m_{2}}^{2}, θ, A) \dots f_{m_{K}}^{K} (X_{m_{K}}^{K}, θ, A),

θ = [θ_{1}, θ_{2}, \dots, θ_{d}]^{T} \in R^{d} .

θ = [θ_{1}, θ_{2}, \dots, θ_{d}]^{T} \in R^{d} .

I_{i, j} = E [\frac{\partial}{\partial θ _{i}} lo g f (x; θ) \frac{\partial}{\partial θ _{j}} lo g f (x; θ)] = - E [\frac{\partial ^{2}}{\partial θ _{i} \partial θ _{j}} lo g f (x; θ)] .

I_{i, j} = E [\frac{\partial}{\partial θ _{i}} lo g f (x; θ) \frac{\partial}{\partial θ _{j}} lo g f (x; θ)] = - E [\frac{\partial ^{2}}{\partial θ _{i} \partial θ _{j}} lo g f (x; θ)] .

\frac{\partial}{\partial θ} lo g f (x; θ)

\frac{\partial}{\partial θ} lo g f (x; θ)

\frac{\partial}{\partial θ} [\int g (x) f (x; θ) d x] = \int g (x) [\frac{\partial}{\partial θ} f (x; θ)] d x

\frac{\partial}{\partial θ} [\int g (x) f (x; θ) d x] = \int g (x) [\frac{\partial}{\partial θ} f (x; θ)] d x

cov_{θ} (g (X)) \geq \frac{\partial ϕ ( θ )}{\partial θ} [I (θ)]^{- 1} (\frac{\partial ϕ ( θ )}{\partial θ})^{T},

cov_{θ} (g (X)) \geq \frac{\partial ϕ ( θ )}{\partial θ} [I (θ)]^{- 1} (\frac{\partial ϕ ( θ )}{\partial θ})^{T},

i = 1 \sum m_{k} F^{k} (c_{i}^{k}, θ), \leavevmode 1 \leq k \leq K,

i = 1 \sum m_{k} F^{k} (c_{i}^{k}, θ), \leavevmode 1 \leq k \leq K,

F (θ) = k = 1 \sum K i = 1 \sum m_{k} F^{k} (c_{i}^{k}, θ),

F (θ) = k = 1 \sum K i = 1 \sum m_{k} F^{k} (c_{i}^{k}, θ),

\frac{\partial ϕ ( θ )}{\partial θ} F^{- 1} (\frac{\partial ϕ ( θ )}{\partial θ})^{T},

\frac{\partial ϕ ( θ )}{\partial θ} F^{- 1} (\frac{\partial ϕ ( θ )}{\partial θ})^{T},

\frac{\partial T ( θ )}{\partial θ} F^{- 1} (\frac{\partial T ( θ )}{\partial θ})^{T} .

\frac{\partial T ( θ )}{\partial θ} F^{- 1} (\frac{\partial T ( θ )}{\partial θ})^{T} .

m_{k}, c_{i}^{k} min \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \frac{\partial T ( θ )}{\partial θ} F^{- 1} (\frac{\partial T ( θ )}{\partial θ})^{T}

m_{k}, c_{i}^{k} min \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \frac{\partial T ( θ )}{\partial θ} F^{- 1} (\frac{\partial T ( θ )}{\partial θ})^{T}

subject to \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode k = 1 \sum K i = 1 \sum m_{k} c_{i}^{k} \leq C,

\leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode F (θ) = k = 1 \sum K i = 1 \sum m_{k} F^{k} (c_{i}^{k}, θ),

\leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode m_{k} \in Z_{\geq 0},

\leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode c_{i}^{k} \geq 0, 1 \leq k \leq K, 1 \leq i \leq m_{k} .

m_{k}, c_{i}^{k} min \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \int h (θ) \frac{\partial T ( θ )}{\partial θ} F^{- 1} (θ) (\frac{\partial T ( θ )}{\partial θ})^{T} d θ

m_{k}, c_{i}^{k} min \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \int h (θ) \frac{\partial T ( θ )}{\partial θ} F^{- 1} (θ) (\frac{\partial T ( θ )}{\partial θ})^{T} d θ

subject to \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode k = 1 \sum K i = 1 \sum m_{k} c_{i}^{k} \leq C,

\leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode F (θ) = k = 1 \sum K i = 1 \sum m_{k} F^{k} (c_{i}^{k}, θ),

\leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode m_{k} \in Z_{\geq 0},

\leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode c_{i}^{k} \geq 0, 1 \leq k \leq K, 1 \leq i \leq m_{k} .

m_{k}, c_{i}^{k} min θ \in Ω max \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \frac{\partial T ( θ )}{\partial θ} F^{- 1} (θ) (\frac{\partial T ( θ )}{\partial θ})^{T}

m_{k}, c_{i}^{k} min θ \in Ω max \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \frac{\partial T ( θ )}{\partial θ} F^{- 1} (θ) (\frac{\partial T ( θ )}{\partial θ})^{T}

subject to \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode k = 1 \sum K i = 1 \sum m_{k} c_{i}^{k} \leq C,

\leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode F (θ) = k = 1 \sum K i = 1 \sum m_{k} F^{k} (c_{i}^{k}, θ),

\leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode m_{k} \in Z_{\geq 0},

\leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode c_{i}^{k} \geq 0, 1 \leq k \leq K, 1 \leq i \leq m_{k} .

\frac{\partial T ( θ )}{\partial θ} F^{- 1} (θ) (\frac{\partial T ( θ )}{\partial θ})^{T} = f (m_{1}, m_{2}, ..., m_{K}, c_{1}^{1}, c_{2}^{1}, ..., c_{m_{1}}^{1}, ..., c_{m_{K}}^{K})

\frac{\partial T ( θ )}{\partial θ} F^{- 1} (θ) (\frac{\partial T ( θ )}{\partial θ})^{T} = f (m_{1}, m_{2}, ..., m_{K}, c_{1}^{1}, c_{2}^{1}, ..., c_{m_{1}}^{1}, ..., c_{m_{K}}^{K})

m_{k}, c_{i}^{k} min f (m_{1}, m_{2}, ..., m_{K}, c_{1}^{1}, c_{2}^{1}, ..., c_{m_{1}}^{1}, c_{1}^{2}, ..., c_{m_{K}}^{K})

m_{k}, c_{i}^{k} min f (m_{1}, m_{2}, ..., m_{K}, c_{1}^{1}, c_{2}^{1}, ..., c_{m_{1}}^{1}, c_{1}^{2}, ..., c_{m_{K}}^{K})

subject to \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode k = 1 \sum K i = 1 \sum m_{k} c_{i}^{k} \leq C,

\leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode F (θ) = k = 1 \sum K i = 1 \sum m_{k} F^{k} (c_{i}^{k}, θ),

\leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode m_{k} \in Z_{\geq 0},

\leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode \leavevmode c_{i}^{k} \geq 0, 1 \leq k \leq K, 1 \leq i \leq m_{k} .

f (X_{1}^{1}, X_{2}^{1}, ...,, X_{m_{1}}^{1}, X_{1}^{2}, X_{2}^{2}, ...,, X_{m_{2}}^{2}, ..., X_{1}^{K}, ...,, X_{m_{K}}^{K}, θ, A)

f (X_{1}^{1}, X_{2}^{1}, ...,, X_{m_{1}}^{1}, X_{1}^{2}, X_{2}^{2}, ...,, X_{m_{2}}^{2}, ..., X_{1}^{K}, ...,, X_{m_{K}}^{K}, θ, A)

= f_{1}^{1} (X_{1}^{1}, θ_{1}, A) f_{2}^{1} (X_{2}^{1}, θ_{1}, A) \dots f_{m_{1}}^{1} (X_{m_{1}}^{1}, θ_{1}, A)

\leavevmode \times f_{1}^{2} (X_{1}^{2}, θ_{2}, A) \dots \times f_{m_{2}}^{2} (X_{m_{2}}^{2}, θ_{2}, A) \dots \times f_{m_{K}}^{K} (X_{m_{K}}^{K}, θ_{K}, A) .

F (θ) = k = 1 \sum K i = 1 \sum m_{k} F^{k} (c_{i}^{k}, θ)

F (θ) = k = 1 \sum K i = 1 \sum m_{k} F^{k} (c_{i}^{k}, θ)

\frac{1}{C} (k = 1 \sum K \frac{\partial T ( θ )}{\partial θ _{k}} p_{k}^{*})^{2} .

\frac{1}{C} (k = 1 \sum K \frac{\partial T ( θ )}{\partial θ _{k}} p_{k}^{*})^{2} .

C_{k}^{*} = \frac{C \frac{\partial T ( θ )}{\partial θ _{k}} p _{k}^{*}}{\sum _{k = 1}^{K} \frac{\partial T ( θ )}{\partial θ _{k}} p _{k}^{*}} .

C_{k}^{*} = \frac{C \frac{\partial T ( θ )}{\partial θ _{k}} p _{k}^{*}}{\sum _{k = 1}^{K} \frac{\partial T ( θ )}{\partial θ _{k}} p _{k}^{*}} .

(c_{i}^{k})^{*} = c_{k}^{**} .

(c_{i}^{k})^{*} = c_{k}^{**} .

m_{k}^{*} = (\frac{C \frac{\partial T ( θ )}{\partial θ _{k}} p _{k}^{*}}{\sum _{k = 1}^{K} \frac{\partial T ( θ )}{\partial θ _{k}} p _{k}^{*}}) / c_{k}^{**} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed Sensor Networks and Detection Algorithms · Wireless Communication Security Techniques · Error Correcting Code Techniques

Full text

How did Donald Trump Surprisingly Win the 2016 United States Presidential Election? an Information-Theoretic Perspective (i.e. Clean Sensing for Big Data Analytics: Optimal Sensing Strategies and New Lower Bounds on the Mean-Squared Error of Parameter Estimators which can be Tighter than the Cramér-Rao Bound)

Weiyu Xu 111 Department of Electrical and Computer Engineering, University of Iowa, Iowa City, IA 52242. Corresponding email: [email protected]. Lifeng Lai 222 Department of Electrical and Computer Engineering, University of California, Davis, CA, 95616. Amin Khajehnejad 333 3Red Trading Group, LLC, Chicago, IL.

Abstract

Donald Trump was lagging behind in nearly all opinion polls leading up to the 2016 United States presidential election of Tuesday, November 8, 2016, but Donald Trump surprisingly won the presidential election. Due to the significance of the United States presidential elections, this raises the following important questions: 1) why most opinion polls were not accurate in 2016? and 2) how to improve the accuracies of opinion polls? In this paper, we study and explain the inaccuracies of opinion polls in the presidential election of 20016 through the lens of information theory. We first propose a general framework of parameter estimation in information science, called clean sensing (polling), which performs optimal parameter estimation with sensing cost constraints, from heterogeneous and potentially distorted data sources. We then cast the opinion polling as a problem of parameter estimation from potentially distorted heterogeneous data sources, and derive the optimal polling strategy using heterogenous and possibly distorted data under cost constraints. Our results show that a larger number of data samples do not necessarily lead to better polling accuracy, which give a possible explanation of the inaccuracies of most opinion polls for the 2016 presidential election. The optimal sensing (polling) strategy should instead optimally allocate sensing resources over heterogenous data sources according to several factors including data quality, and, moreover, for a particular data source, the optimal sensing strategy should strike an optimal balance between the quality of data samples, and the quantity of data samples.

As a byproduct of this research, in a general setting beyond the clean sensing problem, we derive a group of new lower bounds on the mean-squared errors of general unbiased and biased parameter estimators. These new lower bounds can be tighter than the classical Cramér-Rao bound (CRB) and Chapman-Robbins bound. Our derivations are via studying the Lagrange dual problems of certain convex programs. The classical Cramér-Rao bound and Chapman-Robbins bound follow naturally from our results for special cases of these convex programs.

Keywords: parameter optimization, the Cramér-Rao bound (CRB), the Chapman-Robbins bound, information theory, polling, heterogeneous data.

1 Introduction

In many areas of science and engineering, we are faced with the task of estimating or inferring certain parameters from heterogenous data sources. These heterogenous data sources can have data of different qualities for the task of estimation: the data from certain data sources can be more noisy or more distorted than those from other data sources. For example, in sensor networks monitoring trajectories of moving targets, the sensing data from different sensors can have different signal to noise ratios, depending on factors such as distances between the moving target and the sensors, and precisions of sensors. As another example, in political polling, the polling data can come from diverse demographic groups, and the polling data from different demographic groups can have different levels of noises and distortions for a particular qusestionaire.

Even from within a single data source, one can also obtain data of different qualities for inference, through different sensing modalities of different costs. For example, when we use sensors with higher precisions to sense data from a given data source, we can obtain data of higher quality, but at a higher sensing cost. Moreover, when we try to estimate the parameters of interest, we often operate under sensing cost constraints, namely the total costs spent on obtaining data from heterogenous data sources cannot be above a certain threshold.

This raises the natural question: “Under given cost constraints, how do we perform optimal estimation of the parameters of interest, from heterogenous data?” By “optimal estimation”, we mean minimizing the estimation error in terms of certain performance metrics, such as minimizing the mean squared error.

In this paper, we propose a generic framework to answer the question above, namely to optimally estimate the parameters of interest from heterogenous data, under certain cost constraints. In particular, we consider how to optimally allocate sensing resources to obtain data of heterogeneous qualities from heterogeneous data sources, to achieve the highest fidelity in parameter estimation.

Our research is partially motivated by the actual results of the 2016 United States presidential election of Tuesday, November 8, 2016, and the polling results before the election which are mostly contradictory to the actual election results. Donald Trump was lagging behind in nearly all opinion polls leading up to the 2016 United States presidential election, but Donald Trump surprisingly won the presidential election with his 306 electoral votes (state-by-state tallies, without accounting for faithless electors) versus Hillary Clinton’s 232 electoral votes (state-by-state tallies, without accounting for faithless electors). Right before the election in 2016, some polling analysts were very confident about the prediction that Hillary Clinton would win the US presidency: there was nearly a unanimity among forecasters in predicting a Clinton victory. A notable example is that, neuroscientist and polling analyst Sam Wang, one of the founders of Princeton Election Consortium, predicted that a greater than 99 $\%$ chance of a Clinton victory in his Bayesian model [2, 3], as seen in Wang’s election morning blog post titled “Final Projections: Clinton 323 EV, 51 Democratic Senate seats, GOP House” [4, 1]. As an anecdote, before the election, being very confident with his predictions that Hillary Clinton would win the election, Dr. Wang made a promise to eat a bug if Donald J. Trump won more than 240 Electoral College votes, which he later kept by eating a cricket with honey on CNN [10].

The contrast between most poll predictions and the actual results of the 2016 presidential election was so dramatic that it was surprising and puzzling to many pollsters. In fact, the actual election results differed from the polling results evidently, sometimes dramatically, both nationally and statewise. Donald Trump performed better in the fiercely competitive battlegroup Midwestern states where the polls predicted Trump had an advantage, such as Iowa, Ohio, and Missouri, than expected. Trump also won Wisconsin, Michigan and Pennsylvania, which were considered part of the blue firewall. For example, let us consider the final polling average published by Real Clear Politics on November 7, 2016. The poll average showed that, in Wisconsin, Clinton had a +6.5 $\%$ advantage over Trump, while the actual election result showed that Trump had a +0.7 $\%$ advantage over Clinton; the poll average showed that, in Michigan, Clinton had a +3.4 $\%$ advantage over Trump, while the actual election showed Trump had a +0.3 $\%$ advantage over Clinton; the poll average showed that, in Pennsylvania, Clinton had a +1.9 $\%$ advantage over Trump, while the actual election result had Trump +0.3 $\%$ on top. In Iowa, the poll average showed that Trump had a +3.9 $\%$ advantage over Clinton, while the actual election result showed that Trump’s advantage greatly increased to +9.5 $\%$ ; in Missouri, the poll average showed that Trump had a +9.5 $\%$ advantage over Clinton, while the actual election result showed that Trump’s advantage greatly increased to +18.5 $\%$ ; and in Minnesota, Clinton had a +6.2 $\%$ advantage over Trump, while the actual election result showed that Clinton’s advantage significantly shrunk to +1.5 $\%$ . While, in most of the states where the polling results are evidently different from the actual voting results, Trump outperformed the polling results, Clinton also outperformed the polling results in a small number of states, such as in the states of Nevada, Colorado, and New Mexico. Figure 1, cited from [5], shows the difference between the final polling average published by Real Clear Politics [6] on November the 7th, and the final voting results in 16 states. It is worth mentioning, among dozens of polls, the UPI/CVoter poll and the University of Southern California/Los Angeles Times poll were the only two polls that often predicted a Trump popular vote victory or showed a nearly tied election.

The dramatic and consistent differences between the polling results and the actual election returns, both nationally and statewise, cannot be explained by the “margin of errors” of these polling results. This indicates that there are significant and systematic errors in the polling results. This contrast was also alarming, considering election predictions had already had access to big data, and had applied advanced big data analytics techniques. So it is imperative to understand why the predictions from polling were terribly off.

Due to the significance of the United States presidential elections, this raises the following important questions: 1) why most opinion polls were not accurate in 2016? and 2) how to improve the accuracies of opinion polls? While there are many possible explanations for the inaccuracies of the opinion polls for the 2016 presidential election, in this paper, we look at the possibility that the collected opinion data in polling were distorted and noisy, and heterogeneous in noises and distortions, across different demographic groups. For example, supporters for a candidate might be embarrassed to tell the truth, and thus more likely to lie in polling, when their friends and/or local/national news media are vocal supporters for the opposite candidate.

In this paper, we study and explain the inaccuracy of most opinion polls through the lens of information theory. We first propose a general framework of parameter estimation in information science, called clean sensing (polling), which performs optimal parameter estimation with sensing cost constraints, from heterogeneous and potentially distorted data sources. We then cast the opinion polling as a problem of parameter estimation from potentially distorted heterogeneous data sources, and derive the optimal polling strategy using heterogenous and possibly distorted data under cost constraints. Our results show that a larger number of data samples do not necessarily lead to better polling accuracy. The optimal sensing (polling) strategy instead optimally allocates sensing resources over heterogenous data sources, and, moreover, for a particular data source, the optimal sensing strategy should strike an optimal balance between the quality of data samples, and the number of data samples.

As a byproduct of this research, we derive a series of new lower bounds on the mean squared errors of unbiased and biased parameter estimators, in the general setting of parameter estimations. These new lower bounds can be tighter than the classical Cramér-Rao bound (CRB) and Chapman-Robbins bound. Our derivations are via studying the Lagrange dual problems of certain convex programs, and the classical Cramér-Rao bound (CRB) and Chapman-Robbins bound follow naturally from our results for special cases of these convex programs.

The rest of this paper is organized as follows. In Section 2, we introduce the problem formulation of parameter estimation using potentially distorted data from heterogeneous data sources, namely the problem of clean sensing. In Section 4, we cast finding the optimal sensing strategies in parameter estimation using heterogeneous data sources as explicit mathematical optimization problems. In Section 5, we derive asymptotically optimal solutions to the optimization problems of finding the optimal sensing strategies for clean sensing. In Section 6, we consider clean sensing for the special case of Gaussian random variables. In Section 7, we cast the problem of opinion polling in political elections as a problem of clean sensing, and give a possible explanation for why the polling for the 2016 presidential election were not accurate. We also derive the optimal polling strategies under potentially distorted data to achieve the smallest polling error. In Section 8, we derive new lower bounds on the mean-squared errors of parameter estimators, which can be tighter than the classical Cramér-Rao bound and Chapman-Robbins bound. Our derivations are via solving the Lagrange dual problems of certain convex programs, and are of independent interest.

2 Problem Formulation

In this section, we introduce the problem formulation, and model setup for clean sensing (polling). Suppose we want to estimate a parameter $\theta$ (which can be a scalar or a vector), or a function of the parameter, say, $f(\theta)$ . We assume that there are $K$ heterogeneous data sources, where $K$ is a positive integer. From each of heterogeneous data sources, say, the $k$ -th data source, we obtain $m_{{k}}$ samples, where $1\leq k\leq K$ . We denote the $m_{k}$ samples from the $k$ -th data source as $X_{1}^{k}$ , $X_{2}^{k}$ , …, and $X_{{m_{k}}}^{{k}}$ , and these samples take values from domain $\mathcal{D}_{k}$ . We assume that cost $c^{k}_{i}$ was spent on acquiring the $i$ -th sample from the $k$ -th data source, where $1\leq k\leq K$ , and $1\leq i\leq m_{k}$ . We assume that we take action $\mathcal{A}$ in sampling from the $K$ heterogenous data sources. We assume that under action $\mathcal{A}$ , the $m=\sum_{k=1}^{K}m_{k}$ samples $X_{1}^{1}$ , $X_{2}^{1}$ , …, , $X_{{m_{1}}}^{{1}}$ , $X_{1}^{2}$ , $X_{2}^{2}$ , …, , $X_{{m_{2}}}^{{2}}$ , …, $X_{1}^{K}$ , $X_{2}^{K}$ , …, $X_{{m_{K}}}^{{K}}$ follow distribution $f(X_{1}^{1},X_{2}^{1},...,,X_{{m_{1}}}^{{1}},X_{1}^{2},X_{2}^{2},...,,X_{{m_{2}}}^{{2}},...,X_{1}^{K},X_{2}^{K},...,,X_{{m_{K}}}^{{K}},\theta,\mathcal{A})$ . This distribution depends on the parameter $\theta$ , and the action $\mathcal{A}$ .

In this paper, without loss of generality, we assume that under action $\mathcal{A}$ , the samples across data sources are independent, and samples from a single data source are independent. Hence we can express the distribution as follows:

[TABLE]

where $f_{i}^{k}(X_{i}^{k})$ is the probability distribution of $X_{i}^{k}$ , namely the $i$ -th sample from the $k$ -the data source, with $1\leq k\leq K$ and $1\leq i\leq m_{k}$ . We can of course also extend the analysis to more general cases where the samples are not independent.

In this paper, we consider the following problem: under a budget on the total cost for sensing (polling), what is the optimal action $\mathcal{A}$ to guarantee the most accurate estimation of $\theta$ or its function? More specifically, determining the sampling action means determining the number of data samples from each data source, and determining the cost spent on obtaining each data sample from each data source. We consider the non-sequential setting, where the sampling action is predetermined before the actual sampling action happens. In this paper, we use the mean-squared error to quantify the accuracy of the estimation of $\theta$ or a function of $\theta$ . We can also extend this work to use other performance metrics than the mean-squared error, such as those concerning the distribution of the estimation error or the tail bound on the estimation error.

3 Related Works

In [9], the authors considered controlled sensing for sequential hypothesis testing with the freedom of selecting sensing actions with different costs for best detection performance. Compared with [9], our work is different in three aspects: 1) in [9], the authors were considering a sequential hypothesis testing problem, while in this paper of ours, we consider a parameter estimation problem; 2) in[9], the authors worked with samples from a single data source (possibly having different distributions under different sampling actions), while in our paper, we consider heterogenous data sources where samples follow distributions determined not only by the sampling actions but also by the types of data sources; 3) in our paper, we consider continuously-valued sampling actions whose costs can take continuous values, compared with discretely-valued sampling actions with discretely-valued costs in [9].

In [11], the authors considered designs of experiments for sequential estimation of a function of several parameters, using data from heterogeneous sources (types of experiments), with a budget constraint on the total number of samples. In [11], each data sample from each data source always requires a unit cost to obtain, and the observer does not have the freedom of controlling the data quality of any individual sample. Compared with [11], in this paper, each data sample can require a different or variable cost to obtain, depending on the type of data source involved, and the specific sampling action used to obtain that data sample. Moreover, in this paper, the quality of each data sample depends both on the type of data source and on the sampling action used to obtain that data sample. For example, in this paper, we have the freedom of not only optimizing the number of data samples for each data source, but also optimizing the effort (cost) spent on obtaining a particular data sample from a particular data source (depending on the cost-quality tradeoff of that data source); while in [11], one only has the freedom of choosing the number of data samples from each data source. In [11], each data source (type of experiment) reveals information about only one element of the parameter vector $\theta$ ; while in this work, a sample from a data source can possibly reveal information about several elements of the parameter vector $\theta$ .

4 Clean Sensing: Optimal Estimation Using Heterogenous Data under Cost Constraints

In this section, we introduce the framework of clean sensing, namely optimal estimation using heterogenous data under cost constraints. As explained in Section 2, we assume that we spend cost $c_{i}^{k}$ on acquiring the $i$ -th sample from the $k$ -th data source, and, given the costs spent on acquiring each data sample, all the samples are independent of each other. Our goal is to optimally allocate the sensing resources to each data source and each data sample, in order to minimize the Cramér-Rao bound on the mean-squared error of parameter estimations. One can also extend this framework to minimize other types of bounds on the mean-squared error of parameter estimation.

We consider a parameter column vector denoted by

[TABLE]

Under the parameter vector $f(x;{{\theta}})$ , we assume that probability density function of an observation sample is given by $f(x;{{\theta}})$ . Let ${{g}}(X)$ be an estimator of any vector function of parameters, ${{g}}(X)=(g_{1}(X),\ldots,g_{d}(X))^{T}$ (note that we can also consider cases where the $g(X)$ be of a different dimension), and denote its expectation vector ${{E}[{{g}}(X)]}$ by ${{\phi}}({{\theta}})$ .

The Fisher information matrix is a $d\times d$ matrix with its element $I_{{i,j}}$ in the $i$ -th row and $j$ -th column defined as

[TABLE]

We note that, the Cramér-Rao bound on the estimation error relies on some regularity conditions on the probability density function, $f(x;\boldsymbol{\theta})$ , and the estimator $g(X)$ . For a scalar parameter $\theta$ , the Cramér-Rao bound depends on two weak regularity conditions on the probability density function, $f(x;\theta)$ , and the estimator $g(X)$ . The first condition is that the Fisher information is always defined; namely, for all $x$ such that $f(x;\theta)>0$ ,

[TABLE]

exists, and is finite. The second condition is that the operations of integration with respect to $x$ and differentiation with respect to $\theta$ can be interchanged in the expectation of $g$ , namely,

[TABLE]

whenever the right-hand side is finite.

For the multivariate parameter vector ${\theta}$ , the Cramér-Rao bound then states that the covariance matrix of ${g}(X)$ satisfies

[TABLE]

where $\frac{\partial{{\phi}}\left({{\theta}}\right)}{\partial{{\theta}}}$ is the Jacobian matrix with the element in the $i$ -th row and $j$ -th column as $\partial\phi_{i}({{\theta}})/\partial\theta_{j}$ .

We define the Fisher information matrix of ${\theta}$ from the $i$ -th sample of the $k$ -th data source as $F^{k}(c_{i}^{k},\theta)$ (sometimes when the context is clear, we abbreviate it as $F^{k}(c_{i}^{k})$ ) Thus, we have the Fisher information matrix of ${\theta}$ for data from the $k$ -th data source as

[TABLE]

since the Fisher information matrices are additive for independent observations of the same parameters. Since we assume that the samples from different data sources are also independently generated, the Fisher information matrix of ${\theta}$ considering all the data sources is given by

[TABLE]

which we abbreviate as $F$ in the following derivations.

Without loss of generality, we consider estimating a scalar function $T(\theta)$ of $\theta$ . To estimate $T(\theta)$ , we can lower bound the variance of the estimate of $T(\theta)$ by

[TABLE]

where $\phi(\theta)=E(g(X))$ is a scalar. If the estimate of $T(\theta)$ is unbiased, namely $\phi(\theta)=T(\theta)$ , we can lower bound the mean-squared error of the estimate of $T(\theta)$ by

[TABLE]

The goal of clean sensing is to minimize the error of parameter estimation from heterogeneous data sources. Suppose we require the estimation of a function $T(\theta)$ of parameter $\theta$ to be unbiased, one way to minimize the mean-squared error of the estimation is to design sensing strategies which minimize the Cramér-Rao lower bound on the estimate. Mathematically, we are trying to solve the following optimization problem:

[TABLE]

where $C$ is the total budget, and $\mathbb{Z}_{\geq 0}$ is the set of nonnegative integers.

We notice that this optimization depends on knowledge of the parameter vector $\theta$ . However, before sampling begins, we have limited knowledge of the parameters. Depending on the goals of estimation, we can change the objective function of the optimization problem (5) using the limited knowledge of the parameters. For example, if we know in advance a prior distribution $h(\theta)$ of $\theta$ , we would like to minimize the expectation of the Cramér-Rao lower bound for an unbiased estimator over the prior distribution. Mathematically, we formulate the corresponding optimization problem as

[TABLE]

where $C$ is the total budget. If we instead know in advance the parameter to be estimated belongs to a set $\Omega$ , we can also try to minimize the worst-case Cramér-Rao lower bound for an unbiased estimator over $\Omega$ . We can thus write the minimax optimization problem as

[TABLE]

When the Cramér-Rao lower bounds

[TABLE]

are the same for all the possible values of $\theta$ ’s under every possible choice of $c_{i}^{k}$ ’s, then the optimization problems (5), (15), and (10) all reduce to

[TABLE]

5 Optimal Sensing Strategy for Independent Heterogenous Data Sources with Diagonal Fisher Information Matrices

In this section, we will investigate the optimal sensing strategy for independent heterogenous data sources with diagonal Fisher information matrices. In this section, we assume that, under every possible action $\mathcal{A}$ ,

[TABLE]

One can show that under this assumption, the Fisher information matrices

[TABLE]

is a diagonal matrix, where $F^{k}(c_{i}^{k},\theta)$ is a $d\times d$ Fisher information matrix based on observation $X_{i}^{k}$ . Moreover, for every $k$ and $i$ , $F^{k}(c_{i}^{k},\theta)$ is a diagonal matrix, for which only the $k$ -th element of the diagonal, denoted by $F^{k}(c_{i}^{k},\theta_{k})$ , can possibly be nonzero.

Theorem 5.1.

Let us consider estimating a function $T(\theta)$ of an unknown parameter vector $\theta$ of dimension $d$ , using data from $K$ independent heterogenous data sources, where $K$ is a positive integer. From the $k$ -th data source, we obtain $m_{{k}}$ samples, where $1\leq k\leq K$ . We denote the $m_{k}$ samples from the $k$ -th data source as $X_{1}^{k}$ , $X_{2}^{k}$ , …, and $X_{{m_{k}}}^{{k}}$ . We assume that cost $c^{k}_{i}$ was spent on acquiring the $i$ -th sample from the $k$ -th data source, where $1\leq k\leq K$ , and $1\leq i\leq m_{k}$ . We further assume that $X_{1}^{k}$ , $X_{2}^{k}$ , …, and $X_{{m_{k}}}^{{k}}$ are mutually independent.

We let $F^{k}(c_{i}^{k},\theta)$ be a $d\times d$ Fisher information matrix based on observation $X_{i}^{k}$ , as a function of $\theta$ and cost $c_{i}^{k}$ . We assume that $X_{i}^{k}$ only reveals information about $\theta_{k}$ ; namely, we assume that, for every $k$ and $i$ , $F^{k}(c_{i}^{k},\theta)$ is a diagonal matrix, for which only the $k$ -th element of the diagonal, denoted by $F^{k}(c_{i}^{k},\theta_{k})$ as a function of $c_{i}^{k}$ and $\theta_{k}$ , can possibly be nonzero.

We assume, under a cost $c$ , the function $F^{k}(c,\theta_{k})$ satisfies the following conditions:

$F^{k}(c,\theta_{k})$ * is a non-decreasing function in $c$ for $c\geq 0$ ;* 2. 2.

$\frac{c}{F^{k}(c,\theta_{k})}$ * is well defined for $c\geq 0$ ;* 3. 3.

$\frac{c}{F^{k}(c,\theta_{k})}$ * achieves its minimum value at a finite $c=c_{k}^{**}\geq 0$ , and the corresponding minimum value $\frac{c_{k}^{**}}{F^{k}(c_{k}^{**},\theta_{k})}$ is denoted by $p_{k}^{*}$ .*

Let $g(X)$ be an unbiased estimator of a function $T(\theta)$ of parameter vector $\theta$ . Let $C$ be the total allowable budget for acquiring samples from the $K$ data sources. When $C\rightarrow\infty$ , the smallest possible achievable Cramér-Rao lower bound $\frac{\partial T(\theta)}{\partial\theta}F^{-1}(\theta)(\frac{\partial T(\theta)}{\partial\theta})^{T}$ on the mean-squared error of $g(X)$ is given by

[TABLE]

Moreover, when $C\rightarrow\infty$ , for the optimal sensing strategy that achieves this smallest possible Cramér-Rao lower bound on the the mean-squared error, we have the optimal cost allocated for the $k$ -th data source, denoted by $C_{k}^{*}$ , satisfies:

[TABLE]

The optimal cost $(c_{i}^{k})^{*}$ associated with obtaining the $i$ -th sample of the $k$ -th source is given by

[TABLE]

The number of samples obtained from the $k$ -th data source satisfies

[TABLE]

Proof.

We assume that budget $C_{k}$ is allocated to obtain $m_{k}$ samples from the $k$ -th data source, namely

[TABLE]

Then we can conclude that

[TABLE]

where in (23) we use the fact that $\frac{c_{k}}{F^{k}(c^{k},\theta_{k})}$ achieves its minimum value at a finite $c_{k}^{**}\geq 0$ , and the corresponding minimum value $\frac{c_{k}^{**}}{F^{k}(c_{k}^{**},\theta_{k})}$ is denoted by $p_{k}^{*}$ .

Moreover, we claim that there exists a strategy of allocating budget $C_{k}$ to samples in such a way that

[TABLE]

In fact, one can take

[TABLE]

samples, and we spend $c_{k}^{**}$ to obtain each of the samples except for the last sample, on which we spend $\left(\frac{C_{k}}{c_{k}^{**}}-\lfloor\frac{C_{k}}{c_{k^{**}}}\rfloor\right)\times c_{k}^{{**}}$ . Then we have

[TABLE]

In summary, we have

[TABLE]

Then the smallest possible achievable Cramér-Rao lower bound $\frac{\partial T(\theta)}{\partial\theta}F^{-1}(\theta)(\frac{\partial T(\theta)}{\partial\theta})^{T}$ , as defined by the optimal objective function of the following optimization problem,

[TABLE]

can be lower bounded by the optimal value of the following optimization problem:

[TABLE]

We can solve (41) through its Lagrange dual problem and the Karush-Kuhn-Tucker conditions (please refer to Appendix 10.1). For the optimal solution, we have obtained that

[TABLE]

and, moreover, under the optimal $C_{k}^{*}$ , the optimal value of (41) is given by

[TABLE]

Now let us consider upper bounding the smallest possible achievable Cramér-Rao lower bound $\frac{\partial T(\theta)}{\partial\theta}F^{-1}(\theta)(\frac{\partial T(\theta)}{\partial\theta})^{T}$ , as defined by the following optimization problem,

[TABLE]

We note that, because

[TABLE]

the optimal objective value of (46) can be upper bounded by the optimal objective value of the following optimization problem (47):

[TABLE]

We further notice that the optimal objective value of (47) is upper bounded by the objective value of the following optimization problem (47):

[TABLE]

One can solve (48) (please refer to Appendix 10.2), and obtain that

[TABLE]

Moreover, the optimal value of (48) is given by

[TABLE]

Under this strategy, the number of samples $m_{k}$ satisfies

[TABLE]

∎

As can be seen from the conclusions of Theorem 5.1, the cost allocated for sensing from the $k$ -th data source is not the same for each data source: it is proportional how important $\theta_{k}$ is to the estimation goal (namely $\frac{\partial T(\theta)}{\partial\theta_{k}}$ ), and also is proportional to the “inverse” data quality of the $k$ -th source (namely the square root of $p_{k}^{*}$ , which is the optimal “cost-over-Fisher-information-ratio” for the $k$ -th data source). We note that the higher $p_{k}^{*}$ is, the worse the data quality is, and the harder to obtain a certain amount of Fisher information from the $k$ -th data source. Asymptotically, out results show that the cost spent on obtaining each data sample from the $k$ -th data source should be $c_{k}^{**}$ , namely the cost at which the best “Fisher-information-over-cost-ratio” is achieved. This is in contrast to traditional practices of using the same cost for each sample from each data source. We also observe that, asymptotically, the Fisher information provided by the $k$ -th data source is given by

[TABLE]

which implies the Fisher information eventually provided by the $k$ -th data source should be proportional to the square root of “Fisher-information-over-cost-ratio”. This means that the optimal sensing strategy should let the sources with better data qualities eventually provide more Fisher information (assuming the $K$ parameters are of the same level importance to the estimation objective). The number of samples $m_{k}$ from the $k$ -th data source should be given by

[TABLE]

This means the number of samples from a data source is inversely proportional to the square root of the “Fisher-information-cost-product” at the best individual sample cost for that source.

In summary, (assuming that the $K$ parameters are equally important to the estimation objective), the higher data quality the $k$ -th data source provides, the less total budget should be allocated for sampling from the $k$ -th source, and the more eventual Fisher information the $k$ -th data source will provide. To further understand the implications of our results, let us consider the special case $F^{1}(c_{1}^{**},\theta_{1})=F^{2}(c_{2}^{**},\theta_{2})=\cdots=F^{K}(c_{K}^{**},\theta_{K})$ , and $\frac{\partial T(\theta)}{\partial\theta_{1}}=\frac{\partial T(\theta)}{\partial\theta_{2}}=\cdots=\frac{\partial T(\theta)}{\partial\theta_{K}}$ . Then the total sensing budget allocated for the $k$ -th data source should be proportional to $\sqrt{c_{k}^{**}}$ , the number of samples from the $k$ -th data source should be proportional to $\frac{1}{\sqrt{c_{k}^{**}}}$ , and the cost spent on each sample from the $k$ -th data source should be proportional to (equal to) $c_{k}^{**}$ . This implies, that under this special case, the worse data quality the $k$ -th data source provides (namely, for the same Fisher information from an individual sample, a higher cost needs to be spent on a sample), the more sensing budget should be allocated for the $k$ -th data source, a higher cost should be spent on obtaining an individual sample from the $k$ -th data source, but a smaller number samples should be taken from the $k$ -th data source.

6 Clean Sensing for Optimal Parameter Estimation of Gaussian Random Variables

In the last section, we have derived the optimal sensing strategy for independent random variables from heterogenous data sources. As discussed above, we can extend the framework of clean sensing to estimate parameters from dependent random variables. In this section, we will derive the optimal sensing strategies to estimate the parameters related to the mean values of multivariate Gaussian random variables, which are not necessarily independent. In the last section, we have mostly considered the case where data from the $k$ -th data source are only concerned with the $k$ -th parameter $\theta_{k}$ . In this section, we have extended the clean sensing framework to include cases where data from the $k$ -th data source may provide information for more than just one parameter, and sometimes even for all the parameters. Moreover, for the case of multivariate Gaussian random variables, we have closed-form expressions for the related Fisher information matrix, and the examples in this section illustrate how the clean sensing framework can be applied to signal processing examples of parameter estimation for Gaussian random variables.

We consider $K$ Gaussian random variables $X={\begin{bmatrix}X_{1},\dots,X_{K}\end{bmatrix}}^{\mathrm{T}}$ . We assume that the mean values of these random variables are ${\displaystyle\,\mu(\theta)={\begin{bmatrix}\mu_{1}(\theta),\dots,\mu_{K}(\theta)\end{bmatrix}}^{\mathrm{T}}}$ , where $\theta$ is a $K$ -dimensional vector (we can also consider vectors of other dimensions, but we choose $K$ to simplify the analysis). We let ${\displaystyle\,\Sigma(\theta)}$ be its covariance matrix. Then, for $1\leq m,n\leq K$ , the element in the $m$ -th row and the $n$ -th column of the Fisher information matrix (with respect to parameter $\theta$ ) is given by [8]:

[TABLE]

where $(\cdot)^{\mathrm{T}}$ denotes the transpose of a vector, $\operatorname{tr}(\cdot)$ denotes the trace of a square matrix, and

[TABLE]

Note that a special case is the one where ${\displaystyle\Sigma(\theta)=\Sigma}$ is a constant matrix. When $\Sigma(\theta)$ is a constant matrix, we have

[TABLE]

Namely, the Fisher information matrix $F$ can be written as

[TABLE]

where the $K\times K$ matrix $\frac{\partial\mu}{\partial\theta}$ is defined as

[TABLE]

Let $g(X)$ be an unbiased estimator of a function $T(\theta)$ of parameter vector $\theta$ . Assuming $F$ is invertible, we can lower bound the mean-squared error of the estimate $g(X)$ of $T(\theta)$ by

[TABLE]

We denote $v^{T}=\frac{\partial T(\theta)}{\partial\theta}^{T}\left({\frac{\partial\mu}{\partial\theta}}\right)^{-1}$ , then the lower bound of the mean-squared error of the estimate $g(X)$ of $T(\theta)$ is given by

[TABLE]

6.1 Optimal Strategy for One-Time Sampling of Gaussian Random Variables

Suppose that we consider the case where we take one sample from each of the $K$ data sources. We assume that if we spend cost $C_{i}$ on sampling data source $i$ , where $1\leq i\leq K$ , the corresponding covariance matrix $\Sigma$ is given by

[TABLE]

where $v_{i}$ ’s are known vectors. Then the lower bound of the mean-squared error of the estimate of $T(\theta)$ is given by

[TABLE]

Then to minimize the mean-squared error of the estimation under the total cost constraint, we need to solve the following optimization problem:

[TABLE]

One can obtain the solution to (60) as

[TABLE]

and the corresponding smallest lower bound is given by

[TABLE]

where $v^{T}=\frac{\partial T(\theta)}{\partial\theta}^{T}\left({\frac{\partial\mu}{\partial\theta}}\right)^{-1}.$

6.2 Optimal Strategy for Multiple-Time Samplings of Gaussian Random Variables

In this subsection, we assume that there are $K$ data sources, and from each data source, we take $m_{k}$ samples. We assume that the $i$ -th sample from the $k$ -th data source is given by

[TABLE]

where $1\leq i\leq m_{k}$ , and $w_{i}^{k}$ is a zero-mean Gaussian random variable with variance $\sigma^{2}_{k,i}$ . We assume that the random variables $w_{i}^{k}$ ’s are independent from each other. We further assume that we spend cost $c_{i}^{k}$ in obtaining the $i$ -th sample from the $k$ -th data source, and the variance is given by

[TABLE]

where $\sigma^{2}$ is a constant, and $f_{k}(c_{i}^{k})$ is a non-decreasing function of $c_{i}^{k}$ .

From the discussions above, the Fisher information matrix $F$ with respect to $\theta$ is given by

[TABLE]

where the $K\times K$ matrix $\frac{\partial\mu}{\partial\theta}$ is defined as

[TABLE]

and $\Sigma$ is an $m\times m$ diagonal matrix defined as follows:

[TABLE]

with $m=\sum_{k=1}^{K}m_{k}$ .

Then we can further express the Fisher information matrix $F$ as

[TABLE]

where

[TABLE]

We assume that $\mu_{i}^{k}$ = $\mu^{k}$ for every $i$ such that $1\leq i\leq m_{k}$ . Then

[TABLE]

where the $K\times K$ matrix $\frac{\partial\mu}{\partial\theta}$ is defined as in (49) and

[TABLE]

We denote $v^{T}=\frac{\partial T(\theta)}{\partial\theta}^{T}\left({\frac{\partial\mu}{\partial\theta}}\right)^{-1}$ , then the Cramér-Rao lower bound of the mean squared error of the estimate of $T(\theta)$ is given by

[TABLE]

Then to minimize the mean-squared error of the estimation under the total cost constraint, we need to solve the following optimization problem:

[TABLE]

As one example, suppose that $f_{k}(x)=\alpha_{k}^{2}x$ , where $\alpha_{k}$ is a constant, then one can obtain the solution to (72) as

[TABLE]

and the corresponding biggest possible Cramér-Rao lower bound is given by

[TABLE]

As another example, let us consider the general case where we assume that

[TABLE]

where $c=c_{k}^{*}$ satisfies $\frac{f_{k}(c^{*}_{k})}{c^{*}_{k}}=(\alpha_{k}^{*})^{2}$ .

Under this assumption, one can show that asymptotically as $C\rightarrow\infty$ , the corresponding biggest possible Cramér-Rao lower bound satisfies

[TABLE]

and

[TABLE]

We remark that when $T(\theta)$ is a linear function of $\theta$ , and $\mu_{k}(\theta)$ is a linear function of $\theta$ for every $1\leq k\leq K$ , then the optimal sampling strategy is universal for every possible $\theta\in\mathcal{R}^{K}$ . Namely, the optimizer does not need to have prior knowledge of the domain of $\theta$ in optimizing the sampling strategy. We also remark that, when $f_{k}(x)=\alpha_{k}^{2}x$ , for the optimal sensing strategy, the optimal number of samples for each data source can be of any positive number; however, when $f_{k}(x)$ is a general nonnegative increasing function in $x$ , then the optimal sensing strategy also needs to determine the optimal number $m_{k}$ of samples for each data source $k$ .

7 Clean Sensing for Accurate Election Opinion Polling: Optimal Strategies under Distorted Data

In this section, we consider applying the clean sensing framework to the problem of opinion polling for an (political) election, and explains one possible reason for the inaccuracies of the polling for the 2016 presidential election through this framework. We first introduce our mathematical model for the opinion polling.

7.1 Mathematical Modeling of Election Opinion Polling from Heterogeneous Demographic Groups

In an election, we assume that there are two candidates for the targeted position: candidate $A$ and candidate $B$ , for the simplicity of our analysis. We however remark that our analysis can be easily extended to include more than 2 candidates. While our analysis can be extended to incorporate the cases of undecided voters or non-voting citizens, for simplicity of analysis, we assume that every citizen will vote, and every citizen will either vote for Candidate $A$ or vote for Candidate $B$ . We also assume that each citizen has made up their minds about what candidate he/she will vote for at the time of opinion polling.

We consider $K$ demographic groups, and assume that $\theta_{k}$ fraction of people from the $k$ -th demographic group eventually vote for candidate $A$ , where $1\leq k\leq K$ and $0\leq\theta_{k}\leq 1$ . We assume that the population of each demographic group is large enough, such that a person randomly polled from the $k$ -th demographic group eventually votes for candidate $A$ with probability $\theta_{k}$ . Moreover, we assume that the eventual voting decisions of polled persons are independent of each other within a demographic group, and across different demographic groups. We assume that the pollster randomly selects without replacement $m_{k}$ people to ask for their opinions. We use random variable $Z_{i}^{k}$ to represent how the $i$ -th ( $1\leq i\leq m_{k}$ ) person polled from the demographic group $k$ ( $1\leq k\leq K$ ) will eventually vote in the actual election: if the polled person votes for Candidate $A$ , then $Z_{i}^{k}=1$ ; otherwise, $Z_{i}^{k}=0$ . As discussed above, we assume that $Z_{i}^{k}$ are independent random variables, within a demographic group, and across different demographic groups

Moreover, when we sample the opinion of a randomly selected person from the $k$ -th demographic group, we assume that person will always give a response of whether he/she will vote for Candidate $A$ or Candidate $B$ . We use random variable $X_{i}^{k}$ to represent the test response of the $i$ -th polled person from the $k$ -th demographic group: if the test result identify the $i$ -th person from the $k$ -th group will vote for Candidate $A$ , then $X_{i}^{k}=1$ ; otherwise, $X_{i}^{k}=0$ . Suppose that we spend cost $c_{i}^{k}$ on testing the response of the $i$ -th polled person from the $k$ -th demographic group. For example, the pollster can take the low-cost path of simply asking that person for his/her opinions through a phone call or can take the high-cost path of taking the trouble to look at both his/her phone call response and his/her social media posts and other relevant information.

For each $k$ , We let $\beta_{k}(c_{i}^{k})$ and $\gamma_{k}(c_{i}^{k})$ be two functions with $c_{i}^{k}$ as variables, and assume that they take values between [math] and $1$ . We assume that conditioning on $Z_{i}^{k}=1$ , $X_{i}^{k}=1$ with probability $1-\beta_{k}(c_{i}^{k})$ , and $X_{i}^{k}=0$ with probability $\beta_{k}(c_{i}^{k})$ , where $0\leq\beta_{k}(c_{i}^{k})\leq 1$ ; and that conditioning on $Z_{i}^{k}=0$ , $X_{i}^{k}=1$ with probability $\gamma_{k}(c_{i}^{k})$ , and $X_{i}^{k}=0$ with probability $1-\gamma_{k}(c_{i}^{k})$ , where $0\leq\gamma_{k}\leq 1$ . Namely, if a polled person eventually votes for Candidate $A$ , that person provides a different opinion when polled (tested), with probability $\beta_{k}$ ; and if a polled person eventually votes for Candidate $B$ , that person provides a different opinion when polled (tested), with probability $\gamma_{k}(c_{i}^{k})$ .

Moreover, we assume that

[TABLE]

Namely $X_{i}^{k}$ ’s are obtained from $Z_{i}^{k}$ ’s through a discrete memoryless channel in the language of information theory; or, in English, for a certain $i$ and $k$ , $X_{i}^{k}$ only depends on $Z_{i}^{k}$ . Thus $X_{i}^{k}=1$ with probability

[TABLE]

and $X_{i}^{k}=0$ with probability

[TABLE]

where we abbreviate $\beta_{k}(c_{i}^{k})$ and $\gamma_{k}(c_{i}^{k})$ to $\beta_{k}$ and $\gamma_{k}$ respectively.

The goal of the estimator is to estimate

[TABLE]

from the polled data $X_{i}^{k}$ ’s, where we assume that the $k$ -th demographic group constitutes $\alpha_{k}$ fraction of the total voter population. Then the Cramér-Rao bound for estimating $\theta$ is given by

[TABLE]

where $V_{k}(\theta_{k})$ is the Fisher information for $\theta_{k}$ , and $1/V_{k}(\theta_{k})$ is the Cramér-Rao bound for estimating $\theta_{k}$ using the data from the $k$ -th demographic group.

Then the Fisher information of $\theta_{k}$ provided by the $i$ -th sample of the $k$ -th demographic group is

[TABLE]

Then the Fisher information of $\theta_{k}$ provided by the $k$ -th demographic group is given by

[TABLE]

We note that, when $\beta_{k}=0$ and $\gamma_{k}=0$ , $I_{i}^{k}=\frac{1}{\theta_{k}(1-\theta_{k})}$ . This following lemma says the Fisher information achieves its biggest value when $\beta_{k}=0$ and $\gamma_{k}=0$ .

Lemma 7.1.

$I_{i}^{k}(\beta_{k},\gamma_{k})\leq\frac{1}{\theta_{k}(1-\theta_{k})},$ * where $0\leq\beta_{k}\leq 1$ and $0\leq\gamma_{k}\leq 1$ .*

Proof.

The claim is obvious when $\beta_{k}+\gamma_{k}=0$ or $\beta_{k}+\gamma_{k}=2$ . When $\beta_{k}+\gamma_{k}=1$ , then $I_{i}^{k}=0\leq\frac{1}{\theta_{k}(1-\theta_{k})}$ . Now let us consider the case where $0<\beta_{k}+\gamma_{k}<1$ . Then we have

[TABLE]

where in the last step we use the fact that $\frac{\gamma_{k}}{1-\beta_{k}-\gamma_{k}}$ is nonnegative, and $\frac{1-\gamma_{k}}{1-\beta_{k}-\gamma_{k}}\geq 1$ .

We further consider the case $1<\beta_{k}+\gamma_{k}<2$ . We first do a change of variable $\beta^{\prime}=1-\beta_{k}$ and $\gamma^{\prime}=1-\gamma_{k}$ . We thus have $0\leq\beta^{\prime}\leq 1$ , $0\leq\gamma^{\prime}\leq 1$ and $0<\beta^{\prime}+\gamma^{\prime}<1$ . We will show that $I_{i}^{k}(\beta_{k},\gamma_{k})=I_{i}^{k}(\beta^{\prime},\gamma^{\prime})$ , implying that $I_{i}^{k}(\beta_{k},\gamma_{k})\leq\frac{1}{\theta_{k}(1-\theta_{k})}$ when $1<\beta_{k}+\gamma_{k}<2$ . In fact

[TABLE]

This concludes the proof of this lemma. ∎

7.2 Optimal Polling Strategy for a Particular $(\theta_{1},...,\theta_{K})$

In this subsection, we investigate finding the optimal polling strategy which minimizes the Cramér-Rao bound of the mean-squared error of parameter estimation of $\theta$ , for a particular parameter set $(\theta_{1},...,\theta_{K})$ . One can also extend this work to minimize the worst-case mean-squared error (minimax MSE) if the parameter vector is known to be within a certain set or to minimize the average-case mean-squared error if we have a prior knowledge of the distribution of the parameter vectors, as discussed above. In this subsection, we only illustrate the results for a particular parameter set $(\theta_{1},...,\theta_{K})$ , which is useful when the pollster knows that the parameter vector is within a vicinity of that parameter vector.

We would like to decide the optimal number of polled people $m_{k}$ for each demographic group, and decide the the cost $c_{i}^{k}$ spent on polling the $i$ -th person from the $k$ -th demographic group. The goal is to minimize the Cramér-Rao bound of the mean-squared error of parameter estimation of $\theta$ , under a total budget constraint $C$ for polling all the $K$ demographic groups. So we have the following optimization problem formulation:

[TABLE]

We assume for every $\theta$ , under a cost $c_{i}^{k}$ , $I_{i}^{k}(c_{i}^{k})$ satisfies the following conditions:

•

$I_{i}^{k}(c_{i}^{k})$ is a non-decreasing function in $c_{i}^{k}$ for $c_{i}^{k}\geq 0$ ;

•

$\frac{c_{k}}{I_{i}^{k}(c_{i}^{k})}$ is well defined for $c_{k}\geq 0$ ;

•

$\frac{c_{k}}{I_{i}^{k}(c_{i}^{k})}$ achieves its minimum value at a finite $c_{k}^{**}\geq 0$ ;

•

we denote the corresponding minimum value $\frac{c_{k}^{**}}{I_{i}^{k}(c_{k}^{**})}$ as $p_{k}^{*}$ .

We can now see that the clean sensing framework introduced in Section 4 applies. In particular, specializing Theorem 5.1 to the polling problem, we have the following theorem:

Theorem 7.2.

When $C\rightarrow\infty$ , the smallest possible lower bound $\sum_{k=1}^{K}(\alpha_{k})^{2}/V_{k}(\theta_{k})$ on the mean squared error of an unbiased $\theta$ is given by

[TABLE]

Moreover, when $C\rightarrow\infty$ , the optimal sensing strategy that achieves this smallest possible lower bound is given by

[TABLE]

and

[TABLE]

As we can see, for the most accurate polling in terms of minimizing the Cramér-Rao bound, the total cost allocated for for a particular demographic group should be proportional to the population of that group, and proportional to the square root of the best “cost-over-Fisher-information” ratio for that group. Namely, if the cost associated with obtaining a certain amount of Fisher information from a person in a certain demographic group is high (which often implies polling data from that group is more distorted or more noisy), the total cost allocated for that group should also be high. Because the number of persons polled from the $k$ -th group is given by

[TABLE]

the number of polled persons from the $k$ -th demographic group should be inversely proportional to the square root of $I_{i}^{k}(c_{k}^{**})c_{k}^{**}$ , namely the “Fisher-information-cost-product” (at the best individual cost $c_{k}^{**}$ ) for the $k$ -th group. This is in contrast to the common belief that the number of persons polled from a particular group is only proportional to the group’s population.

7.3 Comparisons with Polling using Plain Averaging of Polling Responses

In this subsection, we will demonstrate that, if clean sensing or other similar mechanisms are not used in guaranteeing the quality of data samples in election polling, the polling results can be quite inaccurate. We show that, when the qualities of individual data samples are not controlled to be good enough, more (big) data may not always be able to drive the polling error down to be small.

In particular, in this subsection, we investigate the polling error when plain averaging is used in estimating the parameter $\theta$ . In the plain averaging strategy, the estimation $\hat{\theta}$ of parameter $\theta$ is given by

[TABLE]

where

[TABLE]

Then the mean-squared error of this estimation is given by

[TABLE]

If the cost spent on obtaining each sample from the $k$ -th data source is always equal to $c_{k}$ , then $\beta_{k}(c_{k})$ ’s and $\gamma_{k}(c_{k})$ ’s are the same for the $k$ -th data source, and we denote them by $\gamma_{k}$ and $\beta_{k}$ . For such $\gamma_{k}$ ’s and $\beta_{k}$ ’s, we have

[TABLE]

If $\sum_{k=1}^{K}\alpha_{k}(\gamma_{k}-(\beta_{k}+\gamma_{k})\theta_{k})\neq 0$ , then the expected estimation error of $\theta$ will not go down to 0, even if the number of samples $m_{k}\rightarrow\infty$ for every $k$ . For example, let us consider two demographic groups, where $\alpha_{1}=0.5$ , $\theta_{1}=0.5$ , $\alpha_{2}=0.5$ , and $\theta_{2}=0.5$ . For the first group, $\beta_{1}=0.3$ , and $\gamma_{1}=0$ ; and for the 2nd group, $\beta_{2}=0.1$ , and $\gamma_{2}=0$ . Then

[TABLE]

In fact, we can show that, under fixed $\beta_{k}$ and $\gamma_{k}$ , as the number of samples $m_{k}\rightarrow\infty$ for every $k$ , $\theta-\hat{\theta}$ will converge to $-\sum_{k=1}^{K}\alpha_{k}(\gamma_{k}-(\beta_{k}+\gamma_{k})\theta_{k})$ almost surely. For the example discussed above, this means that $\theta-\hat{\theta}$ will converge to $-\left(0.5(0-0.3\times 0.5)+0.5(0-0.1\times 0.5)\right)=0.1$ . We can see that when $|\sum_{k=1}^{K}\alpha_{k}(\gamma_{k}-(\beta_{k}+\gamma_{k})\theta_{k})|$ is big, the estimation error using plain averaging is also big, leading to inaccurate polling results even if the number of polled people is large.

7.4 Optimal Polling Strategy for Plain Averaging under Distorted Polling Responses

In this subsection, we consider designing optimal polling strategy for parameter estimation using plain averaging, under distorted polling (test) responses. We assume that we have a total budget of $D$ for polling $K$ demographic groups. We need to decide the optimal number of polled persons for each particular group, and the optimal cost to be spent on polling each person from each particular group. Our goal is to minimize the mean-squared error of estimating $\theta$ , when the estimator uses plain averaging. We remark that the mean-squared error of the plain-averaging estimation comes from two parts: the deviation of the mean of polling data from the actual parameter $\theta$ , and the variance of the random estimated parameter $\hat{\theta}$ . The optimal strategy needs to allocate the polling budget in a balanced way to make sure that a sufficient large cost is spent on obtaining each sample such that the deviation of the mean of polling data from the actual parameter $\theta$ is small, while, at the same time, to make sure that a sufficiently large number of persons are polled to reduce the variance of the random estimate $\hat{\theta}$ . In this subsection, we will derive such an optimal polling strategy. In our derivations, we assume that the functions $\beta(c_{i}^{k})$ ’s and $\gamma(c_{i}^{k})$ ’s are explicitly known to the polling strategy designer. However, we remark that we can extend our derivations to the cases where the polling strategy designer does not explicitly know the functions $\beta(c_{i}^{k})$ ’s and $\gamma(c_{i}^{k})$ ’s. For example, we can also extend our derivations to the cases where $\beta(c_{i}^{k})$ ’s and $\gamma(c_{i}^{k})$ ’s are random functions (but remain the same function for each sample from the same data source) and the designer knows only statistical information about these two functions but not the functions themselves (in fact, the results for these extensions are very similar to Theorem 7.3 below). The results derived in the subsection are also useful in the scenario where the polling strategy designer only provides the polling data $X_{i}^{k}$ ’s, but not information about functions $\beta(c_{i}^{k})$ ’s and $\gamma(c_{i}^{k})$ ’s, to the estimator: the polling strategy designer knows explicitly the functions $\beta(c_{i}^{k})$ ’s and $\gamma(c_{i}^{k})$ ’s, but the estimator does not explicitly know the functions $\beta(c_{i}^{k})$ ’s and $\gamma(c_{i}^{k})$ ’s.

Our main results in this subsection is summarized in the following theorem.

Theorem 7.3.

Let us consider $K$ independent data sources, where $K$ is a positive integer. Let $\alpha_{k}$ , $\theta_{k}$ , $\beta_{k}$ and $\gamma_{k}$ be defined the same as above. Let us assume that a fixed cost $c_{k}$ is spent on obtaining each sample from the $k$ -th data source, where $1\leq k\leq K$ . Suppose the total budget available for sampling is given by $D$ . We assume that for every possible values for $c_{k}$ ’s, $\sum_{k=1}^{K}\alpha_{k}(\gamma_{k}-(\beta_{k}+\gamma_{k})\theta_{k})\neq 0$ . We assume that, only when $c_{k}\rightarrow\infty$ for every $k$ , $\sum_{k=1}^{K}\alpha_{k}(\gamma_{k}-(\beta_{k}+\gamma_{k})\theta_{k})\rightarrow 0$ . We further assume that, when $c\rightarrow\infty$ ,

[TABLE]

where $b_{k}$ is a positive number depending only on $k$ .

Then as $D\rightarrow\infty$ , to minimize the mean-squared error of the plain averaging,

[TABLE]

The optimal sampling cost for each sample from the $k$ -th data source is given by

[TABLE]

and the optimal number of persons polled from the $k$ -th group is given by

[TABLE]

Under the assumption of plain averaging, and the assumption of fixed cost for each sample from the same data source, to estimate the parameter $\theta=\sum_{k=1}^{K}\alpha_{k}\theta_{k}$ , the smallest possible mean-squared estimation error is given by

[TABLE]

Proof.

Suppose that the budget allocated for sampling from each data source is $D_{k}$ , and the fixed cost to obtain each sample from the $k$ -th data source is $c_{k}$ , then the mean-squared estimation error is given by

[TABLE]

where $\gamma_{k}$ and $\beta_{k}$ are functions of $c_{k}$ .

From the 1st term of this expression for the mean-squared error, in order to make the estimation error go to [math] as $D\rightarrow\infty$ , we must have $c_{k}\rightarrow\infty$ for all $k$ ’s, as $D\rightarrow\infty$ . Thus $\beta_{k}\rightarrow 0$ and $\gamma_{k}\rightarrow 0$ as $D\rightarrow\infty$ . So when $D\rightarrow\infty$ , we can write the 2nd term of the expression for the mean-squared error between $(1-\epsilon)\sum_{k=1}^{K}\alpha_{k}^{2}\frac{\left[\theta_{k}-\theta_{k}^{2}\right]}{\frac{D_{k}}{c_{k}}}$ and $(1+\epsilon)\sum_{k=1}^{K}\alpha_{k}^{2}\frac{\left[\theta_{k}-\theta_{k}^{2}\right]}{\frac{D_{k}}{c_{k}}}$ , where $\epsilon>0$ is an arbitrarily small positive number. Thus, as $D\rightarrow\infty$ , we can minimize the mean-squared estimation error given by

[TABLE]

This formula can be changed to

[TABLE]

where

[TABLE]

and

[TABLE]

Given $D_{k}$ ’s, we can find the optimal $c_{k}$ ’s which minimize (99) as follows:

[TABLE]

Plugging $c_{k}$ into (99), we can simplify (99) as

[TABLE]

where we used $\tau_{k}=\sqrt{\frac{b_{k}G_{k}}{D_{k}}}=\sqrt{\frac{b_{k}\alpha_{k}^{2}(\theta_{k}-\theta_{k}^{2})}{D_{k}}}$ .

Now we minimize (101) over $D_{k}$ ’s subject to the constraint that $\sum_{k=1}^{K}D_{k}\leq D$ , where $1\leq k\leq K$ . This minimizing $D_{k}$ ’s can be calculated as

[TABLE]

Plugging (102) into (100), we can obtain the optimal $c_{k}$ as follows:

[TABLE]

The number of samples obtained from the $k$ -th data source, denoted by $m_{k}$ , is given by

[TABLE]

Plugging (103) and (102) into (99), we can obtain the smallest achievable mean-squared error through plain averaging is given by

[TABLE]

∎

As can be seen Theorem 7.3, to achieve the smallest possible mean-squared error, the cost spent on obtaining each sample increases as the total budget $D$ increases, in the order of $O(D^{\frac{1}{3}})$ ; and the number samples from a data source also increases as the total budget $D$ increases, in the order of $O(D^{\frac{2}{3}})$ . We would like to contrast this result with traditional polling without accounting for distorted data, namely when $\beta_{k}=0$ and $\gamma_{k}=0$ . If there is no distorted data, we do not have to let the cost spent on per sample grow to $\infty$ as $D\rightarrow\infty$ .

8 New Lower Bounds on the Mean-Squared Error of Parameter Estimation which can be tighter than the Cramér-Rao bound, and the Chapman-Robbins bound

Parameter estimation using observed data is a classical problem in signal processing, and it is also a very important problem in big data analytics. In parameter estimation, we are often interested in obtaining lower and upper bounds on the mean-squared error of parameter estimators. It is particularly important to obtain lower bounds on the mean-squared error (or other error metrics) of parameter estimation, which give fundamental limits on the performance of every possible parameter estimator or parameter estimators of a particular class (such as unbiased parameters). These lower bounds are useful for determining how far a certain parameter estimation algorithm runs from the optimal performance. It is well-known that the Cramér-Rao bound (CRB), and the Chapman-Robbins bound [7] give lower bounds on unbiased parameter estimators or parameter estimators with known bias functions.

In this section, we will derive new and tighter lower bounds on the mean-squared error of parameter estimators. While we can extend our results to biased estimators or estimation of vector parameters, we will focus on our results for unbiased estimators of scalar parameters. We have derived three series of new lower bounds on the mean-squared error of parameter estimators: Lower Bounds Series A (ABS), Lower Bounds Series B (BBS), and Lower Bounds Series C (CBS). Our newly derived lower bounds on the MSE of unbiased estimators can be tighter than the well-known Cramér-Rao bound (CRB), and the Chapman-Robbins bound. Our newly derived lower bounds on the MSE of unbiased estimators are the optimal objective values of certain convex optimization problems. We can solve these convex optimization problems by looking at their Lagrange dual problems, and obtain closed-form new lower bounds on the MSE. Interestingly, we have discovered that both the Cramér-Rao bound (CRB) and the Chapman-Robbins bound are the optimal objective values of the Lagrange dual problems to special cases of these convex programs, and thus are special cases of the newly derived lower bounds in this paper.

Suppose that we would like to estimate a scalar parameter $\theta$ , and assume that the observation is denoted by a random variable $X$ . We further assume that probability density function of $X$ is given by $p(x;\theta)$ , where $\theta$ is a parameter. Suppose that there is an unbiased estimator $g(X)$ such that, for every $\theta$ ,

[TABLE]

We would like to bound the mean-squared estimation error

[TABLE]

even though our optimization framework to derive lower bounds on estimation errors can be further extended to include other metrics, such as $E_{p(x;\theta)}\left(|g(X)-\theta|^{3}\right)$ .

8.1 New Lower Bounds Series A (ABS)

We first notice that

[TABLE]

Moreover, we notice that, for every $\theta_{1}$ , $\theta_{2}$ , and $\theta$ , we always have

[TABLE]

Thus for every $\theta_{1}$ and $\theta_{2}$ , we have

[TABLE]

Thus, we know the MSE of any unbiased estimator is lower bounded by the optimal objective value of the following optimization problem:

[TABLE]

We first notice that this is a convex program in $g(x)$ , which can have an infinite number of constraints. If the domain of $x$ is infinite dimensional, then this program is an infinite-dimensional convex optimization problem. By considering only $N$ pairs of $\theta_{1,i}$ and $\theta_{2,i}$ , where $1\leq i\leq N$ , we have the MSE of any unbiased estimator is lower bounded by the optimal objective value of the following optimization problem with a finite number of constraints:

[TABLE]

We let $h(x)=g(x)-\theta$ , $p(x)=p(x;\theta)$ , $q_{i}(x)=p(x;\theta_{1,i})-p(x;\theta_{2,i})$ and $t_{i}=\theta_{1,i}-\theta_{2,i}$ . Then the optimization problem can be reformulated as:

[TABLE]

The Lagrangian associated with the optimization problem (110) is given by

[TABLE]

For a fixed $x$ ,

[TABLE]

has a minimum value

[TABLE]

when

[TABLE]

Thus the Lagrange dual function associated with the optimization problem (110) is given by

[TABLE]

The Lagrange dual problem to the optimization problem (110) is thus given by

[TABLE]

Expanding $\left(p(x)u+\sum_{i=1}^{N}\lambda_{i}q_{i}(x)\right)^{2}$ to $p^{2}(x)u^{2}+(\sum_{i=1}^{N}\lambda_{i}q_{i}(x))^{2}+2up(x)(\sum_{i=1}^{N}\lambda_{i}q_{i}(x))$ , we can change the dual problem to

[TABLE]

Because $\int q_{i}(x)\,dx=\int(p(x;\theta_{1,i})-p(x;\theta_{2,i}))\,dx=1-1=0,$ (116) is equal to

[TABLE]

with the maximizing $u$ given by $u=0$ .

We consider an $N\times N$ matrix $B$ with its element in the $i$ -th row and $j$ -th column as

[TABLE]

where $1\leq i\leq N$ , and $1\leq j\leq N$ .

We denote $\lambda=(\lambda_{1},\lambda_{2},...,\lambda_{N})^{T}$ , and $t=(t_{1},t_{2},...,t_{N})^{T}$ . Then (118) becomes

[TABLE]

Taking derivatives with respect to the vector $\lambda$ , we have

[TABLE]

and thus the maximizing $\lambda$ is given by

[TABLE]

where we assume $B$ is full-rank. If $B$ is not invertible, we can replace $B^{-1}$ with $B^{\dagger}$ , where the superscript ${\dagger}$ means the Moore-Penrose inverse. If $B$ is not invertible, we assume that $t$ is within the column range of $B$ (otherwise the optimal value (118) will be $\infty$ , implying an unbiased estimator is impossible).

The corresponding maximum objective value of the dual problem is

[TABLE]

Since the optimization problem (109) has no more constraints than the optimization problem (106) , the optimal objective value of (106) is no smaller than the optimal objective value of (109). By the weak duality, the optimal objective value of (109) is bigger than or equal to $t^{T}B^{-1}t$ (if an unbiased estimator is feasible at all, we also have the strong duality, by the Slater’s condition, meaning that $t^{T}B^{-1}t$ is equal to the objective value of (109) ). Thus the optimal objective value of (106) is at least $t^{T}B^{-1}t$ .

Let us summarize the new lower bound on the MSE of an unbiased estimator. Let $\theta_{1,i}$ and $\theta_{2,i}$ ’s be any $N$ pair of values from the domain $\mathcal{D}$ of the parameter, where $1\leq i\leq N$ and $N$ is any positive integer. We define

[TABLE]

where $B$ ’s elements in its $i$ -th row and $j$ -th column is given by

[TABLE]

Then for any unbiased estimator $g(X)$ satisfies

[TABLE]

8.2 Special Cases of the New Lower Bounds Series A on the MSE of an unbiased estimator: Connections with the Cramér-Rao bound, and the Chapman-Robbins bound

In this subsection, we consider some special cases of the newly derived Lower Bounds Series A on the MSE of an unbiased estimator, and show how the classical Cramér-Rao bound and Chapman-Robbins bound are connected with the newly derived bounds in this paper.

We first consider the case where we only have two constraints in the optimization problem (106). We pick two numbers $\theta_{1}$ and $\theta_{2}$ from the domain of the parameter. Then the optimal objective value of following optimization problem gives a lower bound on the MSE of an unbiased estimator.

[TABLE]

By our derivations above, the optimal objective value to (122) is given by

[TABLE]

We consider several special cases of $\theta_{1}$ and $\theta_{2}$ as follows.

Special Case a):

In this case, we take $\theta_{1}$ = $\theta$ , and $\theta_{2}=\theta+\delta$ . Then as $\delta\rightarrow 0$ ,

[TABLE]

where $p^{\prime}(x;\theta)=\frac{dp(x;\theta)}{d\theta}$ . This gives the classical Cramér-Rao bound.

Special Case b):

In this case, we take $\theta_{1}$ = $\theta$ , and $\theta_{2}=\theta+\delta$ . Then

[TABLE]

One can take the supremum of $\frac{\delta^{2}}{\int\frac{(p(x;\theta)-p(x;\theta+\delta))^{2}}{p(x;\theta)}\,dx}$ over the set of $\delta$ , and this gives the Chapman-Robbins bound.

Special Case c):

In this case, we take a general $\theta_{1}$ , and $\theta_{2}=\theta_{1}+\delta$ . Then as $\delta\rightarrow 0$ ,

[TABLE]

where $p^{\prime}(x;\theta_{1})=\frac{dp(x;\theta_{1})}{d\theta_{1}}$ . This gives a new lower bound on the MSE of an unbiased estimator, which is different from both the Cramér-Rao bound, and the Chapman-Robbins bound. This new bound can be tighter than both the Cramér-Rao bound, and the Chapman-Robbins bound

Special Case d):

Because the lower bound (123) holds for arbitrary $\theta_{1}$ and $\theta_{2}$ , we have

[TABLE]

This bound can be tighter than the Cramér-Rao bound, the Chapman-Robbins bound, and the bound in Special Case c).

8.2.1 A Simple Example for which the New Bounds are Tighter than the Cramér-Rao bound, and the Chapman-Robbins bound

Let us consider estimating a parameter $\theta\in[0,1.5]$ . We assume that the observed random variable $X$ follows the Bernoulli distribution. Moreover, $P(X=1;\theta)=1-|1-\theta|$ and $P(X=0;\theta)=|1-\theta|$ .

Let us consider an unbiased estimation of $\theta$ from $X$ , and let $\theta_{1}=0.25$ . Then for the Cramér-Rao bound, the lower bound on the MSE of estimating $\theta_{1}$ from $X$ is given by $\theta_{1}(1-\theta_{1})=\frac{3}{16}$ . The Chapman-Robbins bound is achieved when $\theta_{2}=1.5=\theta_{1}+\delta$ , so $\delta=1.5-\theta_{1}=1.25$ . Thus the Chapman-Robbins bound on the MSE of an unbiased parameter estimator is given by

[TABLE]

In contrast, in the newly derived Lower Bounds Series A, we can look at the Special Case d) and consider $\theta_{1}^{\prime}=0.6$ and $\theta_{2}^{\prime}=1.4$ . One can see that

[TABLE]

In fact, one can easily see that because $\theta_{1}^{\prime}=0.6$ and $\theta_{2}^{\prime}=1.4$ produce totally the same probability distributions for $X$ , it is not possible at all to have an unbiased parameter estimator for this estimation task. Thus the MSE of an unbiased estimator should indeed be $\infty$ . Thus the bound given by Lower Bounds Series A is the tightest, and tighter than Cramér-Rao bound, and the Chapman-Robbins bound. One can of course give many other examples showing the Lower Bounds Series A give tighter bounds than the Cramér-Rao bound, and the Chapman-Robbins bound.

8.3 New Lower Bounds Series B (BBS)

In this subsection, we will derive new lower bounds on the MSE of an unbiased estimator, and we term these new bounds derived in this subsection as Lower Bounds Series B (BBS). We will use different convex optimization problems, and their Lagrange dual problems to derive these new bounds.

Because for every $\theta_{1}$ and $\theta_{2}$ , we have

[TABLE]

then for any function $f(\theta_{1},\theta_{2})$ ,

[TABLE]

Let us take $N$ functions $f_{i}(\theta_{1},\theta_{2})$ , $1\leq i\leq N$ , denote

[TABLE]

and denote

[TABLE]

Thus, we know the MSE of any unbiased estimator is lower bounded by the optimal objective value of the following optimization problem:

[TABLE]

We notice that (127) has the same format as the optimization problem (110). Moreover, we still have

[TABLE]

Thus we conclude that (127) has the same solution as the optimization problem (110), except that the values of $q_{i}(x)$ ’s and $t_{i}$ ’s are different. Again we consider an $N\times N$ matrix $B$ with its element in the $i$ -th row and $j$ -th column as

[TABLE]

where $1\leq i\leq N$ , and $1\leq j\leq N$ . We denote $\lambda=(\lambda_{1},\lambda_{2},...,\lambda_{N})^{T}$ , and $t=(t_{1},t_{2},...,t_{N})^{T}$ .

Then the variance of any unbiased estimator is always lower bounded by $t^{T}B^{-1}t$ . However, we stress that these bounds can be tighter or at least as tight as the Lower Bounds Series A, depending no the choices of functions $f_{i}(\theta_{1},\theta_{2})$ ’s.

8.4 New Lower Bounds Series C (CBS)

In this subsection, we will derive new lower bounds on the MSE of an unbiased estimator, and we term these new bounds derived in this subsection as Lower Bounds Series C (CBS). We will use more general convex optimization problems, for which the integrals of $q_{i}(x)$ ’s over $x$ may not necessarily be 0, and their Lagrange dual problems to derive these new bounds.

We notice that, for every $\theta_{1}$ , we have

[TABLE]

Then for any function $f(\theta_{1})$ , we have

[TABLE]

Let us take $N$ functions $f_{i}(\theta_{1})$ , $1\leq i\leq N$ , denote

[TABLE]

and denote

[TABLE]

Thus, we know the MSE of any unbiased estimator is lower bounded by the optimal objective value of the following optimization problem:

[TABLE]

Note that this problem can be solved in the same way as we have solved (110). We recall that (116) is equal to

[TABLE]

Now we note $\int q_{i}(x)\,dx$ may not necessarily be 0. So the maximizing $u$ is given by the number $u$ which maximizes

[TABLE]

Setting the derivative to [math], we can obtain the maximizing $u$ is given by

[TABLE]

Thus the maximum value of $-\frac{u^{2}}{4}-\frac{u}{2}\left(\sum_{i=1}^{N}\int q_{i}(x)\,dx\right)$ is given by

[TABLE]

In summary, recall that $f_{i}(\theta_{1})$ , $1\leq i\leq N$ are $N$ functions,

[TABLE]

and

[TABLE]

We define

[TABLE]

where $B$ ’s element in its $i$ -th row and $j$ -th column is given by

[TABLE]

Then any unbiased estimator $g(X)$ satisfies

[TABLE]

This newly derived Lower Bounds Series C can be tighter than the Lower Bounds Series A and Lower Bounds Series B, which can in turn be tighter than the Cramér-Rao bound, and the Chapman-Robbins bound.

9 Acknowledgment

Weiyu Xu would like to thank H. C. for conversations about presidential elections partially inspiring Xu’s interest in this topic, and for helping with cleaning and typesetting the solution to one of the Lagrange dual problems in LaTex. W. Xu is also thankful to California Institute of Technology Alumni Association for sending him regularly hard copies of its magazine The Caltech Alumni Association Annual, which introduced to W. Xu the bug eating story [10] of Caltech alumnus Professor Sam Wang at Princeton University, and helped inspire this research.

10 Appendix

10.1 Derivations of the Solution to (41)

We consider the following optimization problem, where $a_{1}$ , …, $a_{K}$ are positive constant numbers:

[TABLE]

subject to

[TABLE]

and

[TABLE]

The Lagrangian of the optimization problem is given by

[TABLE]

where $\lambda$ , $\lambda_{1}$ , …, and $\lambda_{K}$ are nonnegative numbers.

We look at the Karush-Kuhn-Tucker (KKT) conditions for the optimization problem above.

[TABLE]

Since $a_{1},a_{2},...,a_{K}>0$ , we have:

[TABLE]

Therefore, we must have

[TABLE]

One can check that, under these values for $b_{k}$ ’s, $\lambda_{k}$ ’s and $\lambda$ , the KKT conditions are all satisfied. Moreover, the optimal objective value is

[TABLE]

For the optimization problem (41), we can set

[TABLE]

and

[TABLE]

then we get

[TABLE]

and, moreover, under the optimal $C_{k}^{*}$ , the optimal value of (41) is given by

[TABLE]

10.2 Derivations of the Solution to (48)

The Lagrangian of the optimization problem (48) is given by

[TABLE]

where $\lambda$ , $\lambda_{1}$ , $\tau_{1}$ , $\lambda_{2}$ , $\tau_{2}$ , …, $\lambda_{K}$ and $\tau_{K}$ are nonnegative numbers.

Now we look at the Karush-Kuhn-Tucker (KKT) conditions for the optimization problem (48).

[TABLE]

We let $\tau_{k}=0$ and $\lambda_{k}=0$ for $1\leq k\leq K$ . Then we have

[TABLE]

By definition, we have

[TABLE]

implying that

[TABLE]

Thus if $\lambda\neq 0$ and $C_{k}\geq c_{k}^{**}$ , we have

[TABLE]

When $\lambda\neq 0$ and the KKT conditions are satisfied, we have

[TABLE]

so

[TABLE]

Solving for $\lambda$ and $C_{k}$ , we have

[TABLE]

and

[TABLE]

We can see that when $C\geq\sum_{k=1}^{K}c_{k}^{**}$ , the obtained $C_{k}$ ’s, $\lambda$ , $\lambda_{k}$ ’s and $\tau_{k}$ ’s above satisfy all the KKT conditions listed in (10.2). Thus these values will lead to the optimal value of (48). Thus, plugging the optimal $C_{k}$ into the objective function of (48), we have the optimal value of (48) is given by

[TABLE]

Bibliography11

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Final election update: There’s a wide range of outcomes, and most of them come up Clinton. Five Thirty Eight.com. November 8, 2016. Retrieved January 6, 2018.
2[2] Final Projections: Clinton 323 EV, 51 Democratic Senate Seats, GOP house. election.Princeton.edu. Retrieved January 6, 2018.
3[3] Five reasons Nate Silver is wrong & Sam Wang is right: Hillary is 99% + likely to win. Daily Kos.com. Retrieved January 6, 2018.
4[4] Grading the 2016 election forecasts. Buzzfeed.com. Retrieved January 6, 2018.
5[5] https://en.wikipedia.org/wiki/2016 _ _ \_ united _ _ \_ states _ _ \_ presidential _ _ \_ election. Retrieved December 13th, 2018.
6[6] www.realclearpolitics.com. Retrieved December 13th, 2018.
7[7] J. M. Hammersley. On estimating restricted parameters. Journal of the Royal Statistical Society. Series B (Methodological) , 12(2):192–240, 1950.
8[8] Steven M. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory . Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1993.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Abstract

1 Introduction

2 Problem Formulation

3 Related Works

4 Clean Sensing: Optimal Estimation Using Heterogenous Data under Cost Constraints

5 Optimal Sensing Strategy for Independent Heterogenous Data Sources with Diagonal Fisher Information Matrices

Theorem 5.1**.**

Proof.

6 Clean Sensing for Optimal Parameter Estimation of Gaussian Random Variables

6.1 Optimal Strategy for One-Time Sampling of Gaussian Random Variables

6.2 Optimal Strategy for Multiple-Time Samplings of Gaussian Random Variables

7 Clean Sensing for Accurate Election Opinion Polling: Optimal Strategies under Distorted Data

7.1 Mathematical Modeling of Election Opinion Polling from Heterogeneous Demographic Groups

Lemma 7.1**.**

Proof.

7.2 Optimal Polling Strategy for a Particular (θ1,...,θK)(\theta_{1},...,\theta_{K})(θ1​,...,θK​)

Theorem 7.2**.**

7.3 Comparisons with Polling using Plain Averaging of Polling Responses

7.4 Optimal Polling Strategy for Plain Averaging under Distorted Polling Responses

Theorem 7.3**.**

Proof.

8 New Lower Bounds on the Mean-Squared Error of Parameter Estimation which can be tighter than the Cramér-Rao bound, and the Chapman-Robbins bound

8.1 New Lower Bounds Series A (ABS)

8.2 Special Cases of the New Lower Bounds Series A on the MSE of an unbiased estimator: Connections with the Cramér-Rao bound, and the Chapman-Robbins bound

8.2.1 A Simple Example for which the New Bounds are Tighter than the Cramér-Rao bound, and the Chapman-Robbins bound

8.3 New Lower Bounds Series B (BBS)

8.4 New Lower Bounds Series C (CBS)

9 Acknowledgment

10 Appendix

10.1 Derivations of the Solution to (41)

10.2 Derivations of the Solution to (48)

Theorem 5.1.

Lemma 7.1.

7.2 Optimal Polling Strategy for a Particular $(\theta_{1},...,\theta_{K})$

Theorem 7.2.

Theorem 7.3.