On phase retrieval via matrix completion and the estimation of low rank PSD matrices
Marcus Carlsson, Daniele Gerosa

TL;DR
This paper introduces a new non-convex optimization algorithm for estimating low-rank PSD matrices from underdetermined measurements, with applications to phase retrieval and analysis of oversampling effects on stability.
Contribution
It proposes a novel algorithm for low-rank PSD matrix recovery and provides theoretical insights into the stability of phase retrieval under oversampling.
Findings
The algorithm effectively estimates low-rank PSD matrices from limited data.
Oversampling improves the stability of the phase retrieval problem.
Theoretical analysis links oversampling to enhanced robustness in matrix recovery.
Abstract
Given underdetermined measurements of a Positive Semi-Definite (PSD) matrix of known low rank , we present a new algorithm to estimate based on recent advances in non-convex optimization schemes. We apply this in particular to the phase retrieval problem for Fourier data, which can be formulated as a rank 1 PSD matrix recovery problem. Moreover, we provide theory for how oversampling affects the stability of the lifted inverse problem.
| Exp 1 | Exp 2 | Exp 3 | Exp 4 | Exp 5 | |
|---|---|---|---|---|---|
| 0.9402 | 0.9419 | 0.9360 | 0.9461 | 0.9635 | |
| 0.0489 | 0.0579 | -0.0487 | 0.0341 | 0.0439 | |
| -0.0263 | -0.0327 | 0.0339 | 0.0318 | 0.0369 | |
| 0.0260 | -0.0212 | 0.0227 | -0.0298 | 0.0274 | |
| 0.0173 | 0.0158 | -0.0172 | -0.0177 | 0.0155 | |
| -0.0119 | 0.0104 | 0.0100 | 0.0090 | -0.0118 | |
| 0.0051 | 0.0046 | 0.0073 | 0.0061 | 0.0079 | |
| 0 | 0.0003 | 0.0022 | 0.0050 | -0.0021 | |
| 0 | 0 | 0 | 0.0031 | 0 | |
| 0 | 0 | 0 | 0 | 0 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
On phase retrieval via matrix completion and the estimation of low rank PSD matrices
Marcus Carlsson Lund University, Center for Mathematical Sciences.
[email protected], [email protected]
Daniele Gerosa11footnotemark: 1
Abstract
Given underdetermined measurements of a Positive Semi-Definite (PSD) matrix of known low rank , we present a new algorithm to estimate based on recent advances in non-convex optimization schemes. We apply this in particular to the phase retrieval problem for Fourier data, which can be formulated as a rank 1 PSD matrix recovery problem. Moreover, we provide theory for how oversampling affects the stability of the lifted inverse problem.
Keywords: Fourier phase retrieval, low rank matrices, non-convex optimization.
MSC2010: 49M20, 49N45, 65K10, 90C26.
1 Introduction
The (discretized) formulation of the phase retrieval problem consists in finding a complex vector , usually a discretized signal or image, given a certain amount of measurements in form of modulus of scalar products (see [SEC15]), i.e.
[TABLE]
(where are known), and possibly additional geometric constraints. The aim is thus to reconstruct the discrete vector representing the object. In [BCE06] is shown that generic complex measurements are needed in order to be able to distinguish two different signals (up to a phase); in [CEH15] the lower bound was improved to . This number is believed to be close to optimal.
A possible combinatorial approach to the problem has been described in [SCa91], together with the proof that, as formulated, the problem is NP-hard. An optimization-based standard solution technique would be to solve the least-squares minimization problem
[TABLE]
but the quadratic terms in the objective function makes the problem highly non-convex and therefore hard to solve. A convex optimization approach was suggested in [CSV13], called PhaseLift, and this note considers an improvement of this approach. We recommend [FWd15] for a pleasant overview of recent advances regarding PhaseLift and related techniques.
A lot of work has been devoted to deal with cases in which are Gaussian measurements, see e.g. [CLS15], but these algorithms seems to work poorly with Fourier data. The main objective of the present article is to provide improvements of the PhaseLift approach, with a particular focus on Fourier measurements, based on recent advances in non-convex low rank approximation algorithms. The theory relies on the quadratic envelope applied to the indicator functional of matrices with a fixed predetermined rank, and can be used also for any other PSD-estimation problem of fixed low rank (Section 2.1).
1.1 Physical background
The phase retrieval problem appears frequently for instance in elecrodynamics; the single scalar complex field , often called wave-function, solution of the d’Alembert equation
[TABLE]
is enough to describe the electromagnetic disturbance in the free space. Making use of Fourier integral, the wave-function can be spectrally decomposed as superposition of monochromatic fields
[TABLE]
where the spatial wave-function associated with a given monochromatic component of the spectral decomposition of solves the Helmholz equation
[TABLE]
with suitable boundary conditions. In Coherent Diffractive Imaging (CDI) an object (sample) is illuminated by a coherent almost monochromatic wavefield and subsequentially the diffracted far-field intensity pattern is measured by detectors. The current detectors measure intensities but they are not able to measure the phase of the diffracted pattern; the phase retrieval problem consists then in retrieving the object knowing the aforementioned intensity plus, in general, some additional physical constraints. In the Fraunhofer regime, which occurs whenever the Fresnel number ( being the diameter of the region occupied by the sample and the distance between the sample and the detector) it can be shown (see for instance [Pag06]) that the -dimensional propagated disturbance measured at (where the optical axis is considered to be coincident with the axis) is
[TABLE]
where is the wavenumber and the -dimensional Fourier transform. In words this says that "the propagated disturbance has the form of a paraxial modulated spherical wave with the modulation being proportional to (a transversely scaled form of) the two-dimensional Fourier transform of the unpropagated disturbance" ([Pag06]).
A family of algorithms developed for solving the phase retrieval problem in combination with a priori knowledge of the support of the object, relies on the simple idea of alternatingly adjust the support of the image and the modulus of its Fourier transform. This goes back to Gerchberg and Saxton ([GSa72]) and to Fienup ([Fie82]), even though the pool of mathematical ideas, borrowed from convex analysis, goes back at least to Bregman ([Bre65]). The most famous is the error-reduction algorithm which is essentially a non-convex adaptation of the idea of Projection On Convex Sets (cfr. POCS algorithm, chatper 5 of [BCo10]): if is the object we want to reconstruct, supported in , error-reduction performs successive projections between the two sets
[TABLE]
The drawback is that the first set is not convex and therefore there are no theoretical guarantees of the convergence of this scheme. Further algorithms have of course been developed; in [Mar07] the projections are replaced with reflections, in [MBK16] a Newton-type scheme with Tikhonov regularization is proposed, to mention two recent contributions. These methods have been successfully applied with real data, but the non-convexity issue persists.
1.2 PhaseLift approach
In the present note we focus our attention to a lifting approach popularized by [CSV13] and [CES13]. For , where ∗ is the conjugate transpose and the latter product is the usual matrix multiplication, we define a linear operator via
[TABLE]
being the Frobenius inner product and the space of Hermitian matrices. Noticing that and recalling that every positive semidefinite matrix with admits a factorization of the type with , the quadratic phase retrieval problem is lifted onto a linear one with dimension squared; the problem (1) to be solved is now to find a suitable rank PSD-matrix such that
[TABLE]
In real applications, the “perfect measurement” is contaminated by noise, so the problem can be re-casted as an optimization one
[TABLE]
The constraint is however highly non-convex, so a convex relaxation has been proposed in [CSV13] based on the nuclear norm, known for being “low-rank inducing” (cfr. [RFP10]):
[TABLE]
The parameter gives a tradeoff between low rank of the sought minimum and data fit, and this term also induces a bias which scales with . For this reason, in practice one does not in general find a rank 1 matrix, but a low rank one from which a suitable vector candidate can be extracted, we refer to [CSV13] for the details.
1.3 Innovations and discussion of the lifting trick
While recognizing the PhaseLift as an important and celebrated contribution, we have found that this method suffers from a number of drawbacks making its practical use limited. The main issue is the size of the involved objects due to the lifting process. A typical (CDI) measurement generates an image of at least which leads to a vector of size , which in turn yields a -matrix which is then fed to an iterative algorithm which need to compute eigenvalue factorization. The time complexity of the latter, without any implementation tricks, is .
The second drawback stems from the fact that one usually need to run the algorithm several times in order to find a suitable value of , and a third drawback is that one typically need lots of linearly independent measurements for stable recovery. As mentioned initially, one needs at least measurements for having an (essentially) unique solution in the noiseless case. [CSV13] suggests working with where is some (unknown) constant, whereas in their numerical section they use where . When dealing with Fourier data, one commonly used method for boosting is simply oversampling in the Fourier domain [MSC98], but it was observed numerically in [CES13] that this gives an ill-posed inverse problem (see Sections 2.1 and 4.4.3), which led them to instead suggest the use of e.g. masks to increase the amount of measurements.
In this article we will prove that oversampling indeed does not give rise to more linearly independent data beyond , where is the dimension of the problem. In other words it is pointless to oversample more than a factor 2 in each dimension. This also shows that any method based on the lifting principle (3) is bound to fail unless combined with masks or similar tricks, at least for . Furthermore, we will rewrite (4) into an optimization problem where the constrains are built into the functional to be minimized, and then use the quadratic envelope to regularize this functional. Ideally one would like to work with the convex envelope of the functional, but this is not doable in practice. The quadratic envelope yields a non-convex continuous surrogate which has recently been shown to have the desirable property of yielding a regularizer which, just like the convex envelope, does not move the global minima [Car19]. While not completely removing other local minimizers, it reduces the amount and in practice it seems that the corresponding algorithm can find the global minimizer in fairly difficult problems. In other words we suggest a new framework which seems to be able to solve the original problem (4). We refrain from proving this claim mathematically, but remark that in [CGO19] an analogous statement is rigorously shown for the similar problem of low rank recovery without PSD constraints. Here we satisfy with developing the algorithm, with a particular focus on efficient implementation, as well as numerically demonstrating several desirable features (see Section 5);
- •
the algorithm can solve any problem of the type (4) but with a predetermined rank , not necessarily .
- •
The nuclear norm approach (5) yields matrices with rank larger than 1, hence using more degrees of freedom. Despite this, the proposed method gives a better data fit while at the same time fulfilling the rank constraint.
- •
In [CES13] a reweighted version of (5) is used, which is known for better accuracy, (but this algorithm is non-convex as well). In terms of reconstruction accuracy the two algorithms now perform similarly. Benefits of the proposed algorithm is that it finds the correct rank and does so within fewer iterations, plus that it relies on a more developed theoretical framework.
The paper is structured as follows; Section 2 presents the mathematical details behind the new general fixed rank PSD-matrix estimation algorithm, Section 3 presents our theoretical findings regarding the application to phase retrieval and sampling issues, Section 4 covers implementation details and finally Section 5 concludes with some numerical illustrations.
1.4 Notation
will denote the set of complex matrices and the subset of Hermitian (i.e. self-adjoint) matrices. is the size of the unknown vector before lifting to a problem with unknowns, and is the amount of measurements, which we assume is larger than . Images and higher dimensional objects are represented as tensors and the measurements then take place in the Fourier domain which we discretize as , where . Hence and is a constant multiple of , where is the number of masks. denotes the operator (3) and its “matrixification” (as described in Section 4.1) is denoted by A, except in the case of pure Fourier data, in which case we denote these objects by and F respectively. The quadratic envelope is denoted by . denotes composition of two functions, in particular if acts on then acts on , where denotes the eigenvalues of any given ordered non-increasingly.
2 Quadratic envelope approach to low rank recovery
Let be a Hilbert space, a non-convex penalty functional and set
[TABLE]
where is a linear operator from to some other Hilbert space in which the “measurement” lives. In [Car19] the quadratic envelope
[TABLE]
is studied as a means of regularizing functionals of this type. It is shown that the regularized functional
[TABLE]
has some desirable properties when . Most notably, lies between and its lower semi-continuous convex envelope, any local minimizer of is a local minimizer of and the global minimizers of and coincide.
In [CGO18] the quadratic envelope has been studied further for the case of the famous problem in which and or, in case the sought degree of sparsity is known,
[TABLE]
Here denotes the number of non-zero entries in . In these cases has been numerically compared to the more classical -penalty in solving the problem of retrieving a sparse signal. In the companion work [CGO19] the analysis has been lifted to the space of -matrices , and or
[TABLE]
The vector problem and the matrix problem are closely related; note that where denotes the singular values of . It turns out that
[TABLE]
and, more importantly, if has SVD , that . In other words, if we can compute the proximal operator in the scalar case, the matrix case follows immediately. In [CGO19] the theory is much further developed for the case ; in particular we show that for noisy measurements the functional has a unique local minima which is also the solution of . The new ingredient in the present contribution is basically the PSD condition, and we satisfy with numerically testing the machinery on the Phase Retrieval problem as well as some generic low rank recovery problems with .
2.1 Low-rank Positive Semi-Definite problems
We now come to the new material for this paper. In this section we set to be the set of Hermitian matrices and consider the problem of searching for a PSD matrix with fixed rank . Of course, for the PhaseLift problem we will set , but we develop the theory for general . Motivated by the previous section we introduce the function defined by
[TABLE]
where is interpreted elementwise. If denotes the set of positive semi-definite matrices with rank , then it is easy to see that where denotes the eigenvalues of . Given a transform we then have that the fixed-rank minimization problem
[TABLE]
(where denotes the subset of PSD matrices) is equivalent with
[TABLE]
This in turn has the same global minimizer as the regularized problem
[TABLE]
as long as , as argued earlier. The latter is a continuous (in its domain , cfr. Proposition 3.2 of [Car19]) functional whose stationary points can be found for instance via Forward-Backward Splitting scheme (also know as Proximal Gradient Method: see for instance [Bec17])111This statement has a theoretical foundation: for a lower semicontinuous and semialgebric , the function is lower semicontinuous and semialgebric ([Car19]), and therefore by a theorem due to Bolte-Daniilidis-Lewis ([BDL07]) it has the local Lojasiewicz-Kurdyka property. In [ABS13] it is showed that the sequence generated by FBS, a particular case of their Algorithm 3, converges to a stationary point of the functional, or else is unbounded. , as long as the proximal operator of is computable. We now describe how this can be done. We recall that a function is said to be symmetric if for every permutation and for every . We also recall that for a lower semi-continuous function , being an Hilbert space, the (possibly empty or set-valued) proximal operator is defined by
[TABLE]
Proposition 2.1**.**
Let be a symmetric function and consider defined by . Then and, for , we have
[TABLE]
where is a spectral decomposition of .
Proof.
By Proposition 3.1 in [Car19], we have , where . Note that
[TABLE]
and that , by a variation of the von Neumann trace inequality (cfr. [Lew96]). Moreover the equality holds if and only if there exists a unitary matrix such that and . We may therefore assume without loss of generality that for some , being a spectral decomposition of . For such we have that , (where we do not require that is non-decreasing). By the symmetry of we have that , and therefore
[TABLE]
The corresponding claim for follows immediately by .
Let us prove the second part of the statement. We have that if and only if minimizes the functional which is convex, since is the l.s.c. convex envelope of (by Theorem 3.1 [Car19]) and . This in turn happens if and only if . From Section 5.2 of [BoL05] follows that the latter holds if and only if and share the same eigenvectors and
[TABLE]
This is equivalent to and by the symmetry of we have that whenever . Hence any unitary matrix of eigenvectors to are also eigenvectors to , and since , the desired conclusion follows. ∎
We remark that the maximum negative quadrature of is , as shown in [Car19], so the condition ensures that the proximal operator is single valued (since the corresponding minimization problem is strongly convex).
Of course we are interested in the concrete case of , but it turns out that has a rather complicated expression. Luckily, the related transform has a simpler expression, and it follows that we can still compute the by duality; we postpone the details of this to Section 4.2. In the coming section we investigate the structure of the operator for the phase retrieval problem with multidimensional data and Fourier measurements.
3 Images and Fourier data
Let us return to the phase retrieval problem. We consider “images” in dimensions, which can be realized as the tensor product vector space , where we use the same number of points in every dimension for simplicity only. This linear vector space can of course be identified with by introducing some basis, but we will see that it is often convenient to actually skip this step and work directly in the more abstract setting. In this case a rank 1 matrix corresponds to a linear operator on of the form , and the matrix is PSD if and only if the operator can be written , where the bar indicates complex conjugation.
Given elements , where runs over some set of (usually multidimensional) subindices (where ), we seek one element such that
[TABLE]
In this new framework, PhaseLift corresponds to the equivalent problem of finding a rank 1 PSD linear operator on such that
[TABLE]
3.1 Fourier data and limitations to oversampling
In the common case of Fourier measurements, the ’s are discretizations of pure oscillatory exponential functions, and in this case we denote them by and the corresponding operator by in place of . We denote by the linear vector space of all functions on , so that naturally identifies with an element of . Often, we measure on an grid where is the oversampling factor, i.e. in which case is readily identified with . To be more explicit, in this case we have
[TABLE]
for and . However, to allow for unequally spaced sampling we stick to the more general setting where
[TABLE]
and are some frequencies and some set. The equation can now be written
[TABLE]
where is the linear operator defined by .
Since we are working with an ill-posed inverse problem, it is crucial to get as much linearly independent equations as possible. In other words we want to choose so that has maximal rank. The next result basically states that it is pointless to oversample beyond a factor of two.
Lemma 3.1**.**
Independent of how is chosen, the maximal rank of is .
Proof.
We let denote the canonical basis in . The range of is then spanned by the functions which, if we write with , takes the form
[TABLE]
The amount of tuples of the form equals , which gives the desired upper bound. ∎
We now prove that, given sufficient oversampling and a vise choice of , the rank of actually equals the maximal rank . More precisely, we will need that is such that is a linearly independent set on or, if , that these functions span the full space . Such a choice of will be called non-degenerate. By considering the DFT it is clear that non-degenerate sets of frequencies exist for all cardinalities of . More generally, say that we pick our ’s on an irregular product set, i.e. suppose that for each there are “coordinate-frequencies” and the multi-frequencies are formed as where . Then, in order for the multidimensional set to be non-degenerate it suffices for each coordinate set is non-degenerate (see e.g. Theorem 2 in [Hay82] or Proposition 4.1 in [AnC17]).
Proposition 3.2**.**
Given a non-degenerate set of frequencies , the rank of equals
[TABLE]
Proof.
First assume that and denote in this case by . As in the previous proof we have that the range of is spanned by where . In this case, there are as many different ’s as there are ’s, and it follows by basic linear algebra that the set of functions of the form is linearly independent if and only if the set of functions is, which is true by assumption. This finishes the proof under the assumption that .
Given any particular basis for the domain and the codomain, the operator has a matrix representation (of dimension ). We know from what we have already shown that we have linearly independent columns, i.e. has full rank. If we can think of as adding rows to the matrix , which then impossibly can result in a lower rank. Combined with the previous lemma, this shows that the rank in this case is . Similarly, if we can think of as removing rows from the full rank matrix , which then clearly results in a new full rank matrix. Hence the rank will be , and the proof is complete. ∎
The above proposition is somehow bad news for the "oversampling approach", since it shows that the maximum oversampling factor one could hope for without adding more linearly dependent equations is roughly . This conclusion is also backed by the numerical experiments in [CES13]: Section 4.4.3 is devoted to numerically demonstrate that oversampling alone does not yield a well-posed inverse problem. Hence Proposition 2.1 can be seen as a theoretical justification of numerical observations in [CES13]. We also validate this conclusion experimentally in Section 5.2. An interesting remark is that two is also a suitable oversampling parameter for other Phase Retrieval methods, see e.g. [MSC98] which backs up this conclusion experimentally. This can be understood since is a trigonometric polynomial with undetermined coefficients, and it is well known that one needs precisely (non-degenerate) measurements to uniquely determine these coefficients. The following interesting papers [BeP15, BBE17, BrS79, Hay82, HES16, Osh12] contain more information about uniqueness of the Phase Retrieval problem, (as well as other interesting algorithms and plenty of other information). A curious remark is that the factor does get better with the dimension , and it is also well known in the experimental community that 3d reconstructions are more stable; see e.g. [CBM06, CZL14]. For the moment, the case is too computationally expensive with the present method, but we hope to address this shortcoming in future work. The proposition also shows that one does not increase stability, or degree of linear independence to be more precise, by picking from some irregularly sampled grid. Hence we find no motivation to deviate from the simplest possible choice, i.e. picking an in the range and use (10) or, which is the same, setting and in (11).
3.2 Stabilizing PhaseLift by adding new equations
Section 2.1 of [CES13] provide a list of experimental methods which can be employed in order to add further linearly independent measurements to those given by the operator from the previous section. In particular it is argued that one can use masks, that is, you create new measurement vectors by introducing a mask which only lets light through in a region . Mathematically, this amounts to multiplying the image with the characteristic function . If we use regularly sampled measurements as in (10), this gives new measurement-functions via the formula where is realized as a subset of , and runs over the index set . Also ptychography, optical grating and oblique illuminations are considered. However additional linearly independent measurements are created, we can form an operator by extending so that the problems (6)-(8) has an essentially unique solution.
While the methods mentioned above are great from a mathematical perspective, they are often not feasible or slow or expensive from a physical perspective in a concrete experiment. Due to this, a priori estimates on the support of the object measured is still by far the most common method used in practice. This was pioneered by Fienup in [Fie82] and presently the most popular methods to solve the phase retrieval problem in this context are Hybrid Input Output ([Fie82]), Difference Map ([Els03]) or Relaxed Averaged Alternating Reflection ([Luk05]). A simple way to incorporate support constraints into the scheme (6)-(8) is to construct by extending by simply adding linear equations for all pairs such that either or is outside of (recall that is a tensor in ). However, this will yield an algorithm which balances meeting the support constraint versus the low rank inducing functional , and hence the output may still be non-zero off , which may be suitable depending on how certain the support estimate is. An alternative is to set (i.e. only “pure” Fourier data) and use ADMM on the problem
[TABLE]
which will force the solution to obey the support constraint. We leave it for further research to investigate these options from a practical perspective.
3.3 Estimating for masked Fourier data
As explained in section 2, the parameter in the quadratic envelope needs to be chosen considering the size of , and since in practice is a huge matrix it is not desirable to have to compute its norm. We therefore provide some rough estimates here depending only on the dimension of the images, in the simplest case when . We thus assume that is formed as explained in the first paragraph of the previous section, using number of masks (plus pure Fourier data). Let , be the suboperator of connected to a particular mask (so are those measurement where no mask is used.) In other words . The tensors are mutually orthogonal (since the ’s are). Moreover
[TABLE]
so it follows that is times a unitary operator, which has operator norm 1. Now, if we represent as a matrix , then it is clear that (for each ), is obtained from by replacing entire rows and columns by 0. This means that by basic estimates, and hence the triangle inequality implies that . Summing up we have shown that
[TABLE]
4 Implementation aspects
In this section we show how to efficiently minimize (8) without additional constraints, (or at least how to compute a stationary point). Since (8) is of the type where is non-convex but is smooth, we use the Forward-Backward Splitting scheme, as this has been established to converge to a stationary point under assumptions which are applicable in our setting. Moreover we suggest to use the accelerated version FISTA, since we have observed that this significantly speeds up convergence. The algorithm alternates between a gradient update step and a proximal operator step. The computation of the gradient of is mathematically straightforward but very time consuming, and we will therefore introduce certain tricks based on FFT and Toeplitz matrices for efficient evaluation of this step when working with Fourier data, see Section 4.1. The computation of the proximal operator of is rather tricky, the details are given in Section 4.2. We give here a general overview of all the functions involved. For concreteness we present the code when working with Fourier data and a number of masks, and leave it up to the reader to work out details for e.g. ADMM routines with support constraints.
In FISTA, short for Fast Iterative Shrinkage-Thresholding Algorithm, the proximal-gradient steps are taken at the interpolation
[TABLE]
where is a sequence of positive real numbers with some properties that ensure convergence in the convex case (see for instance [Bec17]). We used the sequence , for as suggested in Remark 10.35 of [Bec17]. Both and are initialized at zero. Upon choosing a step-size (which we discuss a little further on), the FISTA algorithm now alternates between the update steps
Compute .
- 2.
, where grad is the gradient of , evaluated at .
- 3.
.
For the latter to be single valued, we need to have , which puts an upper bound to the choice of step-size .
We now discuss suitable choices of and . Based on (15) and the theory for the quadratic envelope, it would seem that would be a natural choice, since it would guarantee that (6) and (8) have the same global minimizer. However, a large also means that is, informally speaking, more non-convex, and we have found that the choice gives a better performance. We have also observed that performance is rather stable with respect to changes in around this value. With this choice, we get the bound for the step-size. However, the convergence of FISTA is guaranteed (in both convex and non-convex cases) if the stepsize is (see [BCo10] and [ABS13]), where is the differentiable function and the Lipschitz constant of its gradient, which in our case leads to . Based on (15), we therefore used in our experiments. In our experience, larger step-size leads to faster convergence, but it is important to not exceed the bound.
4.1 Efficient computation of the gradient step
We shall here only treat Fourier measurements with masks, in which case the gradient can be computed very efficiently by using the Fast Fourier Transform (FFT in short), but the details are a bit intricate. A technical problem is the representation of tensors as vectors in a computer. Given any enumeration of the index set and a vector , we denote the corresponding vector by . More precisely, we let be a bijection and set . If for example (so we are dealing with images) and we vectorize by column-stacking, then
All vectors are realized as column-vectors, so that e.g. becomes a matrix. Similarly, if denotes the amount of elements in the function can be identified with a vector by ordering the elements of . Once both the domain and codomain have been vectorized, the operators and get matrix representations that we denote by A and F. If relates to (and ) as described shortly before (3), it is now easy to see that
[TABLE]
by basic multivariable calculus.
We first consider the case of no masks and follow the oversampling recommendations from Section 3.1; we pick a parameter with and set . This set is then in bijective correspondence with through , and we will implicitly identify in the latter with in the former. Now and derive from pure oscillatory exponentials as in (10). More precisely corresponds to as in (10), where is the tensor counterpart of .
Similarly, since represents an arbitrary matrix in it implicitly defines a tensor on the tensor product of with itself, by the formula . Summing up we have that equals in the space \big{(}\otimes_{j=1}^{d}{\mathbb{C}}^{n}\big{)}\otimes\big{(}\otimes_{j=1}^{d}{\mathbb{C}}^{n}\big{)} or, more explicitly
[TABLE]
where the sums over run over all , runs through and each coordinate of , say , satisfies to ensure that both and are in . We now introduce (where summation bounds for are as before) for , and define outside of this grid. In the case the operation above corresponds to summing over lines in parallel to the diagonal, as in a Toeplitz matrix. For the multivariable case the above is easily computed using , by simply noting that .
Clearly (17) is a sort of inverse Fourier transform on , and moreover the computation of is fast; . To make use of FFT for fast evaluation of this expression jointly for all , we introduce for all . Since usually , note that for each the sum contains at most 2 non-zero entries, so this operation is efficiently evaluated. Summing up we have that
[TABLE]
where denotes the discrete Fourier transform on the grid . This can be efficiently evaluated (i.e. in most programming languages using some preset implementation of the FFT. For example, when and when using MATLAB, we can represent as a multidimensional array and evaluate (18) using the command fftn.
Let us now return to the formula (16). We denote the element computed in (18) by , and similarly we introduce . Then (16) takes the form
[TABLE]
where is the vector representation of the tensor via , as before. At some fixed index pair , this is evaluated as
[TABLE]
where and . Clearly all these values reduce to (at most) different values, and can be computed by an inverse Fourier transform on the tensor , which has time complexity .
This concludes the explanation of how to efficiently compute the gradient step (16) in the case of no masks. Luckily, the incorporation of masks poses very little additional difficulty. Say we have masks as described in Section 3.2. Clearly then the computation of (16) can be separated in independent pieces, one for each mask. We just describe how to compute one of these contributions for a fixed mask . In this case we again have that and we may as well index the vectors using instead. Now corresponds to the tensor and to its vectorization via . If is the vectorization of via and denotes the diagonal operator on with as diagonal, we thus have that with as before. Returning to the coefficients in (16) we have that
[TABLE]
which means that we can use formula (18) upon replacing with Finally, since the summation in (16) can be evaluated as in (19) with the difference that we insert a 0 in the final matrix on positions where either of the two coordinates and lie outside of .
4.2 Computation of the proximal operator
The expression of the proximal operator of is quite involved. We first recall that where is computed by first taking the Legendre transform of and then subtracting . By an alteration of the Moreau decomposition (cfr. Chapter 14 of [BCo10]) it is possible to show (cfr. [Car16], Proposition 3.3) that if we have
[TABLE]
where (see again [Car16])
[TABLE]
being the decreasing rearrangement of . We now sketch the routine for computing . After some basic algebra it turns out that
[TABLE]
By the rearrangement inequality we may assume that is already ordered non increasingly and therefore the latter turns into
[TABLE]
For a given let be the minimum between and the last index for which is non-negative. Then (21) becomes
[TABLE]
Now it is clear that if the vector
[TABLE]
is non-increasing then it is a global minimizer for (21). In particular this happens whenever switches sign before (i.e. ) or whenever . We exclude these two scenarios from the further analysis.
If then and , but looking at (22) we see that setting would diminish the quantity (since ), a contradiction. Thus we have that , and so (22) becomes
[TABLE]
Let be some candidate for the global minimum and set . It is easy to see that must have the following structure:
[TABLE]
Now consider (25) inserted into (24); except from an additive constant this function takes the form
[TABLE]
By inspection it is clear that this function is convex and has a unique minimum on the interval ; indeed the first sum is constant on and strictly convex (increasing) on , whereas the second sum is strictly convex (decreasing) and constant on . We seek for the unique solution of in . Let be the smallest index such that and let be the biggest index such that ; now the set provides a partition of . Let be one of the subintervals. For all values of in let be the first index such that and let be the last, where is given by (25). The formula for the solution of becomes
[TABLE]
If this value is outside then the optimal is to be found in another interval. By strict convexity will hold in at least one interval, and at most two intervals (when lies on the boundary).
In conclusion we can summarize the previous observations in an algorithm for computing the proximal operator:
If or return (23), else
- 2.
compute and ;
- 3.
sort decreasingly and call it ;
- 4.
for set and compute the indices and as described above. Notice that the indices and are the same for all , and this is why it is enough to consider the midpoints;
- 5.
compute the new value of according to (26);
- 6.
if stop and return (25). Otherwise increase with and repeat the steps 4-5.
The “for-loop” 4-6 must stop since is strictly convex and has a unique minimizer.
5 Numerical examples
We will illustrate the various results by conducting four experiments. In the first experiment we will compare the proposed method with PhaseLift and reweighted PhaseLift, in the second we test the oversampling conjectures from Section 3, in the third we discuss the capacity of the proposed method of solving low rank PSD problems of any given rank as predicted in Section 2, and finally we end with the reconstruction of a 2D image using the tricks of Section 4.
5.1 1D synthetic "masked" Fourier measurements
In this subsection we consider 1-dimensional signals with randomly generated with a Gaussian distribution; they represent our "ground truth", i.e. the signals that we are interested in retrieving. The method we proposed is compared to the reweighted nuclear norm technique, following the numerical section of [CES13]. The idea of reweighting the nuclear norm was heuristically introduced by [FHB01]. Despite of the lack of a systematic mathematical theory behind, this trick seemed to work quite well for our problem too, proving to be able to significantly enhance the capacity of the nuclear norm of finding low rank matrices.
We consider binary masks, as introduced in Section 3.2. Following [CES13] we use (oversampling is considered separately in the next section) and three masks, so and . Note that this means that we use 400 equations plus a rank-PSD constraint to determine roughly 5000 variables in the lifted problem. We focus our attention in comparing the approximated solutions to the two problems (8) and
[TABLE]
where are weight (diagonal) matrices and where is as described in section 3.2 and with additive random gaussian noise.
The proposed method only contains one user parameter which is fixed as described in Section 4, independent of noise level. The minimization of (27) on the other hand relies on which must scale with the noise for optimal performance, and moreover the update of the weights requires yet another parameter to be chosen; following [CWB08] we set
[TABLE]
being the outcome of the algorithm after some fixed number of iterations with . is just the identity matrix.
We have not found any clear guidance in the literature for how to pick and ; in [CES13] the algorithm is run several times for different ’s picked according to a golden section search or cross validation. This is very time consuming, especially taking into account that the algorithm is slow to begin with due to the lifting trick. We have found that and seemed to work fairly well in our experiments, and these are the values used in the graphs presented. The bisection search only gives a marginal improvement to the results shown, but it would be unfair to do this in comparison with 8, since this is run only once with one fixed choice of . For practical purposes, the fact that (8) does not rely on intricate parameter choices is clearly a strength.
To compute the numerical approximations we used the FISTA algorithm (already previously described) with iterations for (8) and with iterations for each weights update for (27); we found out that does not give any significant contribution to reconstruction accuracy. The stepsize was fixed to (see section 4) and 5 different trials for each level of noise were carried out, in the sense that for each we run five different experiments, i.e. we randomly generated five different ground truths (and consequently five different noise vectors with ), and we averaged the various outcomes over the number of trials. All the masks were randomly generated and the norm of the measured data vector is averagely ; in the graphs below we cover cases where the noise has up to of the signal magnitude. According to [CES13] the approximated signal reconstructed using (27) is chosen as the eigenvector of the greatest eigenvalue of the output of the algorithm. Since in (8) only rank 1 matrices are involved, the approximated signal will be the only nonzero eigenvector of . The underlying signal is unique only up to a global phase, therefore the distance between and is computed as
[TABLE]
In Figure 3 we plot , and in Figure 4 the average rank of . The graphs show that the performances of the three methods. It is clear that nuclear norm without reweighting is inferior in all aspects. We focus therefore on discussing the proposed method versus reweighted nuclear norm. In terms of “norms precision”, these are essentially comparable; nevertheless the reweighted nuclear norm fails in finding the correct rank for high levels of noise. Thus, when this method is equivalent with the proposed method, it uses more degrees of freedom. On the other hand the method that we propose is always able to retrieve rank 1 solutions, in perfect compliance with the theory developed.
Since determining the numerical rank is somewhat fishy, we include the following tables which show the first ten eigenvalues (decreasingly ordered) of the approximation , computed respectively with (8) and (27), in each of the five different experiments run for the noise level . As is plain to see, the solutions obtained by the reweighted nuclear norm are far from rank 1.
5.2 Stabilizing by oversampling
In broad terms, the conclusion of Section 3 is that it is desirable to oversample a factor 2 (in each dimension), but beyond this point no further gain is expected. We test this for the one dimensional case on synthetic data of size with an identical setup to the one above. As measurement we use where is noise which we vary between 0 and of the magnitude of . Each data point in the below graphs is the average of five distinct trials.
A numerical observation in [CES13] is that one needs 3 masks to have stable inversion. The authors also point out that a low residual error in combination with a poor actual error is an indicator of having an ill-posed inverse problem, i.e. one where several different points minimize the functional in question. Below we plot both graphs as a function of the Noise to Signal Ratio .
To the left we see that the residual errors are almost identical, and moreover close to the line , as expected, since it is clearly unlikely to beat the noise level by any notable margin. Indeed, if we were to get then clearly , so the above graph actually tells us that we most of the time find a solution better than ground truth.
To the right we see the normalized reconstruction error, and we can confirm that the expectations from Section 3 is in perfect compliance with the outcome. Oversampling by a factor two has a significant effect in improving the actual error, whereas oversampling by a factor 3 only has a marginal effect, as predicted by Proposition 3.2. Noteworthy is that also the latter two curves are close to “”.
Encouraged by this result we now try the same setup with 2 masks. As is plain to see, the pattern repeats itself with the major difference that is completely unreliable except for 0 noise, whereas and do a similar job which, although not extremely convincing, clearly outperforms . In conclusion, it seems that the recommendation of oversampling with a factor of two has both theoretical and numerical support, at least in the case .
5.3 Finding the correct rank for general PSD estimation problems
We have ran extensive tests for minimizing (8) with different choices of , and ranks , and never found a single instance where the algorithm converges to a point with the wrong rank, i.e. different from , as long as .222In practice it may be beneficial to violate this restriction, as explained in Section 4 This is in accordance with the theory for quadratic envelopes as developed in [Car19] and summarized in Section 2, stating that local minimizers of (8) form a subset of those for (6), except for the fact that FBS is only guaranteed to find stationary points,333At least relying on the theory summarized in Section 4 not necessarily local minima. This indicates that either there are no saddle point type stationary points, or that FBS actually has the capacity to avoid them. We do not underline these observations with any specific graph, since it seems impossible to design one experiment that would cover all different types of potential applications.
5.4 2D image reconstruction
We complement this section by displaying a 2D reconstruction example. The example chosen is a pixels grayscale real image; the initial measurements were contaminated with of additive gaussian random noise. The algorithm used was again FISTA with eight binary masks and the parameters choice indicated in Section 3.3 where we employed the tricks of Section 4.1 for efficient evaluation via FFT of the gradient-step. The reconstruction was made on a standard laptop and the bottleneck timewise for the algorithm is the eigendecomposition, which prohibits larger images to be processed. We plan to address this shortcoming in future works, which clearly needs to be solved for the algorithm (as well as PhaseLift) to be of practical use for 2D or 3D imaging.
Our experience suggested that the 2D problem needs a higher number of masks (with respect to the 1D case) in order to be stabilized; this observation is present in [CES13] as well where, in the noisefree case, eight binary masks are used. We used for simplicity, it is of course possible that oversampling could lead to fewer masks, but we have not yet tested this.
On the side, the original image is the second while the first is our reconstruction.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[An C 17] F. Andersson and M. Carlsson, On the Structure of Positive Semi-Definite Finite Rank General Domain Hankel and Toeplitz Operators in Several Variables . Complex Analysis and Operator Theory, 11(4), pp. 755-784, 2017.
- 2[ABS 13] H. Attouch, J. Bolte and B. F. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods . Mathematical Programming, 137(1-2), pp. 91-129, 2013.
- 3[BCE 06] R. Balan, P. Casazza and D. Edidin, On signal reconstruction without phase . Applied and Computational Harmonic Analysis, 20(3), pp. 345-356, 2006.
- 4[B Co 10] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces . Springer, CMS Books in Mathematics, 2010.
- 5[Bec 17] A. Beck, First-Order Methods in Optimization . MOS-SIAM Series in Optimization, 2017.
- 6[Be P 15] R. Beinert and G. Plonka, Ambiguities in one-dimensional phase retrieval of structured functions . Proceedings in Applied Mathematics and Mechanics, 15(1), pp. 653-654, 2015.
- 7[BBE 17] T. Bendory, R. Beinert and Y. C. Eldar, Fourier Phase Retrieval: Uniqueness and Algorithms . Preprint, ar Xiv:1705.09590 v 3 , 2017.
- 8[BDL 07] J. Bolte, A. Daniilidis and A. Lewis, The Lojasiewicz Inequality for Nonsmooth Subanalytic Functions with Applications to Subgradient Dynamical Systems . SIAM J. Optim., 17(4), pp.1205-1223, 2007.
