A Stochastic LBFGS Algorithm for Radio Interferometric Calibration

Sarod Yatawatta; Lukas De Clercq; Hanno Spreeuw; Faruk Diblen

arXiv:1904.05619·astro-ph.IM·April 16, 2019

A Stochastic LBFGS Algorithm for Radio Interferometric Calibration

Sarod Yatawatta, Lukas De Clercq, Hanno Spreeuw, Faruk Diblen

PDF

TL;DR

This paper introduces a stochastic LBFGS algorithm tailored for large-scale radio interferometric data calibration, enabling fine-resolution processing that preserves transient signals like FRBs and improving neural network training.

Contribution

The paper presents a novel stochastic LBFGS algorithm capable of calibrating large datasets at high resolution without data averaging, enhancing detection of narrowband signals.

Findings

01

Effective calibration at fine time and frequency resolutions.

02

Preservation of signals like fast radio bursts (FRBs).

03

Improved neural network training performance.

Abstract

We present a stochastic, limited-memory Broyden Fletcher Goldfarb Shanno (LBFGS) algorithm that is suitable for handling very large amounts of data. A direct application of this algorithm is radio interferometric calibration of raw data at fine time and frequency resolution. Almost all existing radio interferometric calibration algorithms assume that it is possible to fit the dataset being calibrated into memory. Therefore, the raw data is averaged in time and frequency to reduce its size by many orders of magnitude before calibration is performed. However, this averaging is detrimental for the detection of some signals of interest that have narrow bandwidth and time duration such as fast radio bursts (FRBs). Using the proposed algorithm, it is possible to calibrate data at such a fine resolution that they cannot be entirely loaded into memory, thus preserving such signals. As an…

Equations9

V_{pq} = i = 1 \sum K J_{p i} C_{pq i} J_{q i}^{H} + N_{pq}

V_{pq} = i = 1 \sum K J_{p i} C_{pq i} J_{q i}^{H} + N_{pq}

{\bf v}_{pq}={\bf s}_{pq}({\mbox{\boldmath$\theta$}})+{\bf n}_{pq}

{\bf v}_{pq}={\bf s}_{pq}({\mbox{\boldmath$\theta$}})+{\bf n}_{pq}

x = [real (v_{12}^{T}), imag (v_{12}^{T}), real (v_{13}^{T}), \dots]^{T}

x = [real (v_{12}^{T}), imag (v_{12}^{T}), real (v_{13}^{T}), \dots]^{T}

\displaystyle{\bf m}({\mbox{\boldmath$\theta$}})=[\mathrm{real}({\bf s}_{12}^{T}),\mathrm{imag}({\bf s}_{12}^{T}),\mathrm{real}({\bf s}_{13}^{T}),\ldots]^{T}

f({\mbox{\boldmath$\theta$}})=\sum_{i=1}^{4TFN(N-1)}\log\left(1+\frac{\left({\bf x}[i]-{\bf m}({\mbox{\boldmath$\theta$}})[i]\right)^{2}}{\nu}\right)

f({\mbox{\boldmath$\theta$}})=\sum_{i=1}^{4TFN(N-1)}\log\left(1+\frac{\left({\bf x}[i]-{\bf m}({\mbox{\boldmath$\theta$}})[i]\right)^{2}}{\nu}\right)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Stochastic LBFGS Algorithm for Radio Interferometric Calibration

Abstract

We present a stochastic, limited-memory Broyden Fletcher Goldfarb Shanno (LBFGS) algorithm that is suitable for handling very large amounts of data. A direct application of this algorithm is radio interferometric calibration of raw data at fine time and frequency resolution. Almost all existing radio interferometric calibration algorithms assume that it is possible to fit the dataset being calibrated into memory. Therefore, the raw data is averaged in time and frequency to reduce its size by many orders of magnitude before calibration is performed. However, this averaging is detrimental for the detection of some signals of interest that have narrow bandwidth and time duration such as fast radio bursts (FRBs). Using the proposed algorithm, it is possible to calibrate data at such a fine resolution that they cannot be entirely loaded into memory, thus preserving such signals. As an additional demonstration, we use the proposed algorithm for training deep neural networks and compare the performance against the mainstream first order optimization algorithms that are used in deep learning.

Index Terms— Calibration, Radio interferometry, LBFGS, FRB, Machine learning

1 Introduction

Modern radio interferometers produce raw data at a very fine time and frequency resolution covering a large observing bandwidth and a long observing time. This means that the total number of data points at the original resolution can run into billions, thus entering radio astronomy into the big-data era. In order to apply most calibration algorithms (e.g., [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) the number of data points needs to be reduced in size to fit into memory. This is done by averaging the data in frequency and in time, thus reducing the resolution of the data. While this makes the data manageable, it also has some drawbacks. First, signals of scientific interest with a narrow bandwidth and time duration such as FRBs [13] might be lost. Second, the removal of low-power, broad-band interfering signals (both terrestrial and celestial origin) [14, 15, 16] might be difficult.

In this paper, we address the calibration of radio interferometric data at the original resolution that is determined by the correlator. Calibration is essential and preferable at this resolution for the removal of strong interfering signals such as the Sun and a few other strong radio sources (e.g., Cassiopeia A, Cygnus A). Naturally, the amount of data being calibrated will not fit into memory and in order to solve the nonlinear optimization problem associated with calibration, we introduce a stochastic, limited-memory Broyden Fletcher Goldfarb Shanno (LBFGS) algorithm. Due to faster convergence than gradient descent methods and lower memory usage than methods that require the full Hessian [17, 18], LBFGS has been extensively used in our calibration software [19].

In LBFGS, the direction of descent is estimated by using the gradient of the cost function and an approximation to the (inverse) Hessian of the cost function, represented by a few norm $1$ updates [17]. Once the descent direction is found, the amount of descent is determined by selecting the step size based on a line search algorithm. The approximation to the Hessian is updated with each iteration, by using the difference in the parameters and the difference in the gradients (curvature pairs) [17, 20]. There are several problems that need to be overcome in adopting LBFGS to multi-batch mode operation. First, there is increased noise in gradient estimation using a mini-batch compared to the use of the full dataset. Secondly, and more critically, calculating the difference of gradients cannot in principle use two different mini-batches for gradient estimation. We reiterate that our main constraint is the size of the dataset and we aim to devise an algorithm that can load fixed size mini-batches into memory. The novelty of the proposed method and the relation to prior work are as follows.

•

Estimating the variance of the gradient is a crucial element both in [21] and our method. This ideally requires the gradient information for each data point within a single mini-batch. However, in our method, we use an online estimate of the variance of the gradient (based on [22]) and this does not require the gradient of each data point within a mini-batch. This is the major novelty in our work.

•

In order to find the difference in gradients, data overlap is used in [23, 24]. Our method does not use data overlap (nor is it practically feasible). However, we do use more than one iteration per mini-batch, which can be considered as full data overlap.

•

Variable mini-batch sizes are used in [21], however we keep the mini-batch size fixed.

•

A fixed step size is used in [23, 24] and [25] uses normalized descent directions, both slowing the convergence. As in [21], our method uses a variable step size based on Armijo line search [20]. Since we use an online estimate of the variance of the gradient, the initial value of step size for the line search is determined by this value, similar to [21].

•

We use stable curvature pair updates as in [24, 21]. In fact, whenever a new mini-batch is used, we skip the update of the curvature because the difference in gradients is based on different data. This is a drawback of our algorithm that requires more than one iteration per each mini-batch.

•

Our previous work used ordered subsets acceleration in [26] where we used mini-batches for the gradient (Jacobian) estimation but the cost function was evaluated using the full batch. However, this method does not work when the full batch cannot fit in memory (which is the problem we are concerned with in this work).

The rest of the paper is organized as follows. We give a brief introduction to radio interferometric calibration using the LBFGS algorithm in section 2. We describe the proposed stochastic LBFGS algorithm in section 3. Next, we provide test results based on radio interferometric calibration and deep learning in 4 illustrating the performance of the proposed algorithm. Finally, we draw our conclusions in section 5. Notation: Matrices and vectors are denoted by bold upper and lower case letters as ${\bf J}$ and ${\bf v}$ , respectively. The transpose and the Hermitian transpose are given by $(\cdot)^{T}$ and $(\cdot)^{H}$ , respectively. The matrix Frobenius norm is given by $\|\cdot\|$ and the $l_{1}$ norm by $\|\cdot\|_{1}$ . The set of real and complex numbers are denoted by ${\mathbb{R}}$ and ${\mathbb{C}}$ , respectively. The identity matrix (size $N\times N$ ) is given by ${\bf I}_{N}$ . The Hadamard (element wise) product is denoted by $\circ$ .

2 Radio interferometric calibration

In this section we give a brief overview of radio interferometric calibration in relation to the use of the LBFGS algorithm. A single data point produced by a pair of stations $p$ and $q$ forming an interferometer can be modeled as [27]

[TABLE]

where ${\bf V}_{pq}$ , ${\bf J}_{pi}$ , ${\bf J}_{qi}$ , ${\bf C}_{pqi}$ and ${\bf N}_{pq}$ are in ${\mathbb{C}}^{2\times 2}$ . All matrices in (1) are functions of time and frequency and this is implicitly assumed throughout the rest of the paper. This observed signal ${\bf V}_{pq}$ is modeled as a superposition of electromagnetic radiation emanating from $K$ distinct celestial sources in the sky. In (1), ${\bf C}_{pqi}$ represents the source coherency for the $i$ -th source, seen from the interferometer (or baseline) $p$ - $q$ . The values of ${\bf C}_{pqi}$ can be exactly calculated for any given $p,q,i$ and at any time and frequency [28]. The signals from celestial sources are corrupted by the atmosphere as well as the instrument and these corruptions are represented by the matrices given by ${\bf J}_{pi}$ , ${\bf J}_{qi}$ in (1). We also have the noise matrix ${\bf N}_{pq}$ . Calibration is the estimation of ${\bf J}_{pi}$ for all $p$ and $i$ , or estimating a set of parameters $\theta$ describing ${\bf J}_{pi}$ .

Vectorizing (1) we get

[TABLE]

where ${\bf s}_{pq}({\mbox{\boldmath$ \theta $}})=\sum_{i=1}^{K}({\bf J}_{qi}^{\star}\otimes{\bf J}_{pi})vec({\bf C}_{pqi})$ and ${\bf v}_{pq}=vec({\bf V}_{pq})$ , ${\bf n}_{pq}=vec({\bf N}_{pq})$ . We represent ${\bf J}_{pi}$ and ${\bf J}_{qi}$ in (2), as $\theta$ , a vector of real parameters of length $8NK$ ( $\in{\mathbb{R}}^{8NK\times 1}$ ), but this could have more parameters if we add time or frequency dependence to ${\bf J}_{pi}$ and ${\bf J}_{qi}$ . We stack data points as vectors

[TABLE]

where (assuming data at $T$ time and $F$ frequency samples are calibrated together) ${\bf x}$ and ${\bf m}({\mbox{\boldmath$ \theta $}})$ are vectors of size $N(N-1)/2\times 8\times T\times F$ ( $\in{\mathbb{R}}^{4TFN(N-1)}$ ).

The cost function minimized by LBFGS in robust calibration is given by

[TABLE]

where ${\bf x}[i]$ and ${\bf m}({\mbox{\boldmath$ \theta $}})[i]$ represent the $i$ -th elements of ${\bf x}$ and ${\bf m}({\mbox{\boldmath$ \theta $}})$ , respectively, and $\nu$ $(=2)$ is a constant [5]. More detail about the use of full batch mode LBFGS in calibration can be found in [19]. In the next section, we consider modifications to the original LBFGS algorithm when the full dataset cannot be loaded into memory.

3 Stochastic LBFGS algorithm

We describe the modifications made to the batch mode LBFGS algorithm for operation in mini-batch (stochastic) mode in this section. The key driver for these changes is the need to calibrate a dataset that cannot entirely fit into memory, i.e., the length of ${\bf x}$ $4TFN(N-1)$ is too large to fit into memory. The number of parameters or length of $\theta$ $8NK$ is large but not too large that it cannot fit into memory. We use $M$ as the memory size of the LBFGS algorithm, and for the stochastic mode of operation, we need $2$ additional vectors of the size of $\theta$ . Thus, in total, the extra memory requirement is $2(M+1)$ vectors of the size of $\theta$ .

The pseudocode for the stochastic LBFGS algorithm is given in algorithm 1. We describe in detail key operations that are different from the full batch LBFGS algorithm as follows. Note that we keep the mini-batch size fixed, and we do not have access to gradient statistics within a mini-batch in our algorithm. The price to pay for this approach is that more that one iteration is needed per each mini-batch (ideally about $4$ ). In all our practical tests, we keep the memory size for Hessian approximation as $M=7$ .

•

Line 4: The mini-batches can be processed in any arbitrary order, even though it is presented as sequential processing.

•

Lines 9-10: Online estimation of variance of gradient is based on [22]. The update of ${\bf g}_{k}$ and ${\bf m}_{k}$ are only done if a new batch of data is received (line 8). Based on the estimated variance, the upper bound for step size $\overline{\alpha}$ is found (line 11).

•

Line 13: BFGS update using $M$ curvature pairs is done as in [20].

•

Lines 15-17: Step size selection is done using Armijo line search as in [21, 20]. In contrast, in the full batch LBFGS algorithm [19, 29], we use exact line search with cubic interpolation [17].

•

Lines 19, 22: We update curvature pairs $({\bf y},{\bf s})$ only if the gradient difference on line 21 is based on the same batch and if the inner product over norm ratio is larger than a threshold $\epsilon$ . This is similar to the safe update rule used in [24, 21].

In the next section, we test the performance of algorithm 1 against current contenders.

4 Test results

We demonstrate the performance of the proposed stochastic LBFGS algorithm in its intended use, i.e., radio interferometric calibration as well as a completely different application, i.e., deep learning in this section.

4.1 Radio interferometric calibration

We simulate an array with $N=1024$ stations calibrating data along $K=4$ directions in the sky. We use $T=10$ time samples and a single frequency ( $F=1$ ). Therefore the amount of data points being calibrated is $4N(N-1)T\approx 41.9$ million. The number of parameters estimated is $8NK=32768$ , which is much smaller than the number of data points. In real life, the number of data points being calibrated could run upto a few billion and the simulation presented here is only a scaled down version. The systematic errors ${\bf J}_{pi},{\bf J}_{qi}$ in (1) are drawn from a complex uniform distribution in $[0,1]$ as $\mathcal{U}(0,1)+\jmath\mathcal{U}(0,1)$ . Noise is finally added to the data with a signal to noise ratio $E\{\frac{\|{\bf V}_{pq}\|^{2}}{\|{\bf N}_{pq}\|^{2}}\}=40$ and noise is drawn from a complex circular Gaussian distribution. The $K=4$ point sources have intensities ranging from $1$ to $5$ intensity units and are randomly positioned in the sky. The solutions are initialized to ${\bf J}_{pi}={\bf I}_{2}$ for all $p$ and $i$ .

We compare the performance of LBFGS in full batch mode operation against the stochastic version in calibrating the data in Fig. 1. In full batch mode, we load the whole dataset into memory and use exact line search with cubic interpolation [17, 29] for step size selection. In mini-batch mode operation we use $1/5$ -th of the full dataset at each iteration ( $B=5$ ), and for each mini-batch of data, we use $J=3$ ( $E=17$ ) and $J=4$ ( $E=13$ ) LBFGS iterations in algorithm 1. We have shown the cost (4) evaluated over the full dataset in Fig. 1 for comparison. Note that in mini-batch LBFGS, the cost evaluated at each iteration of algorithm 1 is actually based on a smaller dataset and what is shown in Fig. 1 is the cost based on the full dataset, only for comparison.

From Fig. 1 we see that we reach the same final cost by using the stochastic version of LBFGS, but the convergence is slower, i.e., many more iterations are needed. Moreover, increasing $J$ (iterations per mini-batch) makes the convergence faster. In terms of computational cost, all three modes of operation in Fig. 1 are almost equal. For the stochastic version, the cost of computing the cost function and the gradient is $1/5$ -th of the cost of computing them with the full dataset. However, more iterations are used in the stochastic mode, thus making the total computational cost almost the same. Nonetheless, the key advantage is that the stochastic mode is using much less data at each iteration. Note also that we have already compared the use of first order methods in [29] in full batch mode of operation against the full batch mode LBFGS and we see significantly better performance in LBFGS. Therefore we have not presented any such comparisons for radio interferometric calibration here.

4.2 Deep learning

The use of the stochastic LBFGS algorithm has so far given mixed results in deep learning [23, 24, 21, 25] compared to the popular first order methods used in training, i.e., stochastic gradient descent (SGD) [30] and Adam [31]. We have implemented algorithm 1 in PyTorch [32], a popular machine learning package. We compare the performance of the proposed algorithm in training a simple convolutional neural network with $2$ convolutional layers and $3$ fully connected layers as given in [33]. We use the CIFAR 10 dataset [34] and use $50000$ images for training and $10000$ images for verification.

Compared with previous examples of using LBFGS for training [23, 24, 21, 25], we have made two changes to the setup. First, we have replaced the Rectified linear unit (ReLU) activation with exponential linear unit (ELU) [35] activation to get non-zero second order derivatives. Our second modification is related to improving the computational cost. Automatic differentiation is used in PyTorch for calculating the gradient during training. However, this gradient is not needed every time the cost function is calculated in algorithm 1. Note that the cost function needs to be calculated many times for the line search (lines 15-17 in algorithm 1). This makes the computational cost of LBFGS much larger than SGD or Adam. However, we can improve this by disabling gradient calculation during the line search operation. In order to do this, we have to modify the definition of the loss function (closure()) as shown in Fig. 2.

In Fig. 3, we show the training error for the proposed stochastic LBFGS ( $M=10$ ), with ReLU and ELU activation, compared with SGD and Adam (with ReLU activation). We see similar performance with both ReLU and ELU activation for SGD and Adam and we show only one result. The mini-batch size used is $32$ and the training error is shown for every $200$ batches. Per mini-batch, we use $J=4$ iterations in algorithm 1.

Note that the training error is shown against CPU time used in training. Both SGD and Adam runs about $\times 5$ faster and therefore can process $\times 5$ more epochs than LBFGS for the same amount of total CPU time. The verification accuracy for $10000$ images is 63% for LBFGS with ELU activation as well as for SGD and Adam but LBFGS with ReLU activation has a lower verification accuracy of 61%. We clearly see the improvement of LBFGS due to ELU activation in Fig. 3 and shows the potential of LBFGS for further applications in deep learning.

5 Conclusions

We have proposed a stochastic LBFGS algorithm that can handle the calibration of very large volumes of radio interferometric data while using a limited amount of compute memory. We have demonstrated the performance of the proposed algorithm by comparing it with the full batch LBFGS algorithm. In addition, we have tested its use in deep learning, where we get comparable performance provided that activation functions are modified. We have implemented the proposed algorithm in PyTorch [32] and it is available for general use in other machine learning applications [36]. Future work will focus on adopting the proposed algorithm for big-data frameworks as in [37] for large scale distributed calibration of radio interferometric data especially for the detection of FRBs.

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A.J. Boonstra and A.J. van der Veen, “Gain calibration methods for radio telescope arrays,” IEEE Trans. Sig. Proc. , vol. 51, no. 1, pp. 25–38, Jan. 2003.
2[2] S. van der Tol, B. Jeffs, and A.J. van der Veen, “Self calibration for the LOFAR radio astronomical array,” IEEE Trans. Sig. Proc. , vol. 55, no. 9, pp. 4497–4510, Sept. 2007.
3[3] S.J. Wijnholds and A.J. van der Veen, “Multisource self-calibration for sensor arrays,” IEEE Trans. Sig. Proc. , vol. 57, no. 9, pp. 3512–3532, May 2009.
4[4] S. Kazemi, S. Yatawatta, S. Zaroubi, P. Labropoluos, A.G. de Bruyn, L. Koopmans, and J. Noordam, “Radio interferometric calibration using the SAGE algorithm,” Monthly Notices of the Royal Astronomical Society , vol. 414, no. 2, pp. 1656–1666, June 2011.
5[5] S. Kazemi and S. Yatawatta, “Robust radio interferometric calibration using the t-distribution,” Monthly Notices of the Royal Astronomical Society , vol. 435, pp. 597–605, Oct. 2013.
6[6] S. Kazemi, S. Yatawatta, and S. Zaroubi, “Clustered calibration: an improvement to radio interferometric direction-dependent self-calibration,” Monthly Notices of the Royal Astronomical Society , vol. 430, no. 2, pp. 1457–1472, 2013.
7[7] S. Yatawatta, “Radio interferometric calibration using a Riemannian manifold,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on , 2013, pp. 3866–3870.
8[8] C. Tasse, “Nonlinear Kalman filters for calibration in radio interferometry,” Astronomy and Astrophysics , vol. 566, pp. A 127, June 2014.