Second-best Beam-Alignment via Bayesian Multi-Armed Bandits
Muddassar Hussain, Nicolo Michelusi

TL;DR
This paper introduces a Bayesian multi-armed bandit approach for mm-wave beam alignment, optimizing the balance between exploration and exploitation to improve alignment probability and throughput.
Contribution
It proposes a novel second-best preference policy based on Bayesian bandits, outperforming existing methods in beam alignment tasks.
Findings
Up to 30% improvement in alignment probability.
Superior performance over Thompson sampling and UCB methods.
Effective in analog beamforming simulations.
Abstract
Millimeter-wave (mm-wave) systems rely on narrow-beams to cope with the severe signal attenuation in the mm-wave frequency band. However, susceptibility to beam mis-alignment due to mobility or blockage requires the use of beam-alignment schemes, with huge cost in terms of overhead and use of system resources. In this paper, a beam-alignment scheme is proposed based on Bayesian multi-armed bandits, with the goal to maximize the alignment probability and the data-communication throughput. A Bayesian approach is proposed, by considering the state as a posterior distribution over angles of arrival (AoA) and of departure (AoD), given the history of feedback signaling and of beam pairs scanned by the base-station (BS) and the user-end (UE). A simplified sufficient statistics for optimal control is derived, in the form of preference of BS-UE beam pairs. By bounding a value function, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Second-best Beam-Alignment via
Bayesian Multi-Armed Bandits
Muddassar Hussain, Nicolò Michelusi This research has been funded by NSF under grant CNS-1642982.Authors are with the School of Electrical and Computer Engineering, Purdue University. email: {hussai13,michelus}@purdue.edu.
Abstract
Millimeter-wave (mm-wave) systems rely on narrow-beams to cope with the severe signal attenuation in the mm-wave frequency band. However, susceptibility to beam mis-alignment due to mobility or blockage requires the use of beam-alignment schemes, with huge cost in terms of overhead and use of system resources. In this paper, a beam-alignment scheme is proposed based on Bayesian multi-armed bandits, with the goal to maximize the alignment probability and the data-communication throughput. A Bayesian approach is proposed, by considering the state as a posterior distribution over angles of arrival (AoA) and of departure (AoD), given the history of feedback signaling and of beam pairs scanned by the base-station (BS) and the user-end (UE). A simplified sufficient statistic for optimal control is identified, in the form of preference of BS-UE beam pairs. By bounding a value function, the second-best preference policy is formulated, which strikes an optimal balance between exploration and exploitation by selecting the beam pair with the current second-best preference. Through Monte-Carlo simulation with analog beamforming, the superior performance of the second-best preference policy is demonstrated in comparison to existing schemes based on first-best preference, linear Thompson sampling, and upper confidence bounds, with up to 7%, 10% and 30% improvements in alignment probability, respectively.
Index Terms:
Millimeter-wave, beam-alignment, multi-armed bandits, Markov decision process
I Introduction
Millimeter-wave (mm-wave) technology has emerged as a promising solution to meet the demands of future communication systems supporting high capacity and mobility, thanks to abundant bandwidth availability [1]. However, high isotropic path loss and sensitivity to blockages pose challenges in the design of these systems [2]. To overcome the severe signal attention, mm-wave systems leverage narrow-beam communications, by using large antenna arrays at base stations (BSs) and user-ends (UEs). However, narrow beams are highly susceptible to mis-alignment due to mobility and blockage, hence they require utilization of beam-alignment schemes, which may cause huge overhead.
Therefore, the design of beam-alignment schemes with minimal overhead is of paramount importance, and has been a subject of intense research. One of the earliest yet most popular schemes is exhaustive search [3], which scans sequentially through all possible BS-UE beam pairs and selects the one with maximum signal power for data communications. To reduce the delay of exhaustive search, iterative search is proposed in [4], where scanning is first performed using wider beams, followed by refinement using narrow beams. In the aforementioned heuristic schemes, the optimal design is not considered. To address this challenge, in our previous papers [5, 6, 7, 8], we considered the optimal design of interactive beam-alignment protocols that utilize 1-bit feedback from UEs. In [5, 6], we design a throughput-optimal beam-alignment scheme for a single UE and two UEs, respectively, and we prove the optimality of a bisection search; in [7], we optimize the trade-off between data communication and beam-sweeping in a mobile scenario where the BS widens its beam to mitigate the uncertainty on the UE position; in [8], we incorporate the energy cost of beam-alignment, and prove the optimality of a fractional search method. In our aforementioned papers [5, 6, 7, 8], the optimal design is carried out under the restrictive assumption of error-free single-bit feedback. However, this assumption may not hold in the presence of significant side-lobe gain and/or low signal-to-noise ratio (SNR).
The case of erroneous or noisy feedback is considered in recent work [9, 10], and our work [11]. A coded beam-alignment scheme is proposed in [11] to correct these errors, but with no consideration of feedback to improve beam-selection. A multi-armed bandit (MAB) formulation based on upper confidence bound (UCB) is proposed in [9], by selecting the beam based on the empirical SNR distribution. A hierarchical beam-alignment scheme based on posterior matching is proposed in [10]: therein, a first-best policy is formulated, which selects the most likely beam pair based on the posterior distribution on the AoA-AoD pair. However, as we will see numerically, both UCB and first-best policies are prone to errors due to under-exploration of the beam space.
In this paper, we propose a beam-alignment design with the goal to maximize the alignment probability and the average throughput during the data communication phase. We pose the problem as a Markov decision process (MDP), where the beam pair is chosen based upon the belief over the AoA-AoD pair, given the history of scanned beams and the received signal power. We identify a simplified sufficient statistic in the form of preference of the AoA-AoD beam pairs. We derive lower and upper bounds to the value function, based on which we propose a heuristic policy which selects the beam pair with the second-best preference. We show numerically that this policy strikes a favorable trade-off between exploration and exploitation: instead of greedily choosing the beam corresponding to the most likely AoA-AoD pair (first-best [10]), it chooses the second most likely one, leading to better exploration; at the same time, it avoids wasting precious resources to scan unlikely beam pairs, leading to better exploitation than other MAB techniques, such as linear Thompson sampling (LTS) [12] and UCB [9]. The proposed second-best scheme is shown to outperform first-best [10], LTS-based [12] and UCB-based [9] schemes by up to 7%, 10% and 30% in alignment probability, respectively.
The rest of the paper is organized as follows. In Sec. II, we present the system model. In Sec. III, we formulate the problem and our proposed solution strategy. In Sec. IV, we present numerical results, followed by final remarks in Sec. V.
II System Model
We consider a downlink scenario with one BS and one UE, as depicted in Fig. 1. Time is divided into frames of duration , each with slots of duration . The frame is partitioned into two phases: a beam-alignment phase of duration ( slots), followed by a downlink data communication phase, of duration . Each beam-alignment slot is further partitioned into a pilot transmission phase, of duration , followed by a feedback phase, of duration , with . These are detailed next.
The BS and UE are equipped with uniform linear arrays (ULAs) with and antenna elements, respectively, and use analog beamforming. The signal received at the UE is
[TABLE]
where is the average transmit power of the BS; is the transmitted signal with symbols with ; is the channel matrix; is the BS beamforming vector with ; is the UE combining vector with ; is additive white Gaussian noise (AWGN), with one-sided power spectral density and system bandwidth .
Channel Model: We use the extended Saleh-Valenzuela geometric model with a single-cluster [14], as adopted in several previous works (e.g., see [15, 16, 8]). In fact, typical mm-wave channels have been shown to exhibit one dominant cluster containing most of the signal energy [17]. The single-cluster channel is modeled as
[TABLE]
where is the angle of arrival (AoA) and angle of departure (AoD) pair associated to the dominant cluster, with complex fading gain ; and are the UE and BS array response vectors, respectively, defined as
[TABLE]
where , is the antenna spacing, is the wavelength at carrier frequency , denotes the speed of light. We assume that during the duration of one frame , remains unchanged, , and are i.i.d. Reyleigh fading in each slot with distribution , where is the path loss at distance from the BS. In fact, the AoA-AoD pair change much slower than the channel gain [18].
Codebook structure: In slot , the BS uses the beamforming vector and the UE uses the combining vector , from the codebooks and , respectively. We assume a sectored model [8], in which the AoA and AoD spaces are partitioned into sectors of equal beamwidth (as shown in Fig. 1 for the case of four sectors, this model approximates well analog beamforming). Accordingly, let and denote the AoA and AoD supports of the UE combiner and BS beamformer vectors and , respectively, with equal beamwidth and , where denotes the measure . We define as the joint AoA-AoD support of . We assume that the angular supports are mutually orthogonal and form a partition of the entire AoA-AoD space , i.e., and . Let be any ordering of combining and beamforming vectors, and be their support. Let be the beam index of the combining and beamforming vectors scanned in slot , so that . Let be a discrete random variable denoting the index of the support that the AoA-AoD pair of the channel belongs to, so that . Then, from (1)-(2), the received signal can be expressed as111The phase of is incorporated into .
[TABLE]
where is the Kronecker’s delta function, equal to if alignment is achieved (), equal to [math] otherwise (); and are, respectively, the main and side lobe gains of the sectored model, expressed as
[TABLE]
[TABLE]
In the following, we describe the beam-alignment and data communication procedures.
Beam-Alignment: In each slot of the beam-alignment phase, the BS transmits a pilot sequence using the beam index , with transmit power . Upon receiving (based on the combining vector with index ), the UE uses a matched filter to compute the signal strength and sends the normalized received power feedback signal back to the BS, of the form
[TABLE]
where is the pre-beamforming receive SNR during beam-alignment. Then, the probability density function (pdf) of conditional on is given by
[TABLE]
where is the mean signal power in case of alignment, with
[TABLE]
The BS uses a Bayesian approach to select : starting from and given the history of feedback and scanned beam indices , the next beam index is selected. This procedure continues until the end of the beam-alignment phase.
Data communication: Upon completion of the beam-alignment phase, given the history of feedback and actions , the BS selects the data communication parameters: beam index for data communication , transmission power , and data rate . These parameters are used until the end of the data communication phase.
Let be the prior belief over (or equivalently over ) available at the beginning of the beam-alignment phase. We define the expected rate during the communication phase (normalized by the frame duration), as
[TABLE]
where we have defined
[TABLE]
The probability term in (II) is the probability of achieving correct alignment, given the prior and the history of feedback and actions during the beam-alignment phase, whereas the probability term in (8) denotes the probability of non-outage with respect to the realization of the fading process (i.i.d. over time), given that correct alignment has been achieved (we assume that mis-alignment yields outage with probability one, since ).
III Problem Formulation and Solution
We now formulate the beam-alignment and data communication problem in the context of a decision process. We define a policy , part of our design, which operates as follows. At time during beam-alignment, given the history of feedback and actions , the BS selects the beam-alignment action with probability ; given , the BS selects the data communication parameters as . The goal is to design so as to maximize the expected communication rate, i.e.,
[TABLE]
where the expectation is conditional on the prior belief and on the policy being executed during beam-alignment and data communication. Note that, using (II), we can rewrite the optimization problem as
[TABLE]
i.e., the problem can be decomposed into the following two independent problems:
- find the optimal rate and power that maximize the expected rate in the communication phase, conditional on correct alignment being achieved ();
- find the optimal beam-alignment policy and the beam index for communication so as to maximize the probability of correct alignment. The first problem can be solved efficiently by maximizing (8). In the sequel, we consider the latter problem.
Let be the belief over given the history of actions and feedback and prior belief . It serves as a sufficient statistic for optimal control for problem P1. In the following lemma, we present an equivalent simplified sufficient statistic along with its dynamics.
Lemma 1**.**
Let denote the prior preference of . Given the action and feedback pair , the belief at is updated as
[TABLE]
where
[TABLE]
and we have defined
[TABLE]
Proof.
Given the belief and , we have
[TABLE]
where (a) follows from the definition of belief; (b) follows from Bayes’ rule and denotes proportionality up to a normalization factor independent of ; (c) follows from the facts that is independent of history given , and is independent of action given , and by the definition of belief ; (d-e) follow by substitution of the pdf of given in (5) and by definition of . We prove the lemma using induction. The lemma holds for by definition of . Let and be given by (9), then using (III)(e) normalized to sum to one, we get
[TABLE]
where is given by (10). ∎
Let . Then, the previous lemma demonstrates that is a sufficient statistic for control decisions, since it is sufficient for computing the belief at time . Therefore, can be expressed as , which maps the current preference vector to beam index . This result makes it possible to achieve an efficient implementation, since the belief can be updated according to simple preference update rules as in (10), rather than via complex Bayesian belief updates. In the subsequent analysis, we will use rather than as the state.
III-A MDP Formulation
Thanks to the identification of the sufficient statistic , we model the optimization problem P1 as a Markov decision process (MDP) and optimize the decision variables to maximize the alignment probability in the data-communication phase. The MPD is a 5-tuple , with elements described as follows.
Time Horizon: given as where denote the slot indices associated with the beam-alignment phase, whereas at , the communication parameters are selected and used until the end of the frame.
State space: given as , i.e., all possible values of preference vectors .
Action space: the set containing all the beam indices, .
**State transition distribution: ** Given state and action used in the th stage of the beam-alignment phase, the feedback is generated with pdf
[TABLE]
leading to the new state
[TABLE]
where is the vector with entries .
Reward function: the reward is the probability of choosing a beam index such that in the data communication phase, so that correct alignment is achieved, yielding
[TABLE]
We now formulate the value function iteration for the MDP.
III-B Value Function
The value function under the optimal policy is given as
[TABLE]
where is the Q-function under the state-action pair , defined recursively as
[TABLE]
and for , using (14),
[TABLE]
This yields the optimal value function in the data communication phase, by choosing the beam index with maximum preference ,
[TABLE]
In the beam-alignment phase (), combining (17) and (III-B), we obtain iteratively the value function as
[TABLE]
In the following theorem, whose proof is provided in the Appendix, we unveil structural properties of . We find a lower-bound and an upper-bound to the Q-function and show that these bounds are optimized by a policy which, in each stage of the beam-alignment phase, selects the beam index with the second-best preference. This result will be the basis for our proposed policy evaluated numerically in Sec. IV.
Theorem 1**.**
For , the Q-function is bounded as
[TABLE]
where we have defined
[TABLE]
where
[TABLE]
Let be an ordering of beam indices in decreasing order of preference, i.e., , then the optimal value function is bounded as
[TABLE]
with the maximizer of and given by the second-best beam index .
Proof.
The proof is provided in the Appendix. ∎
As a result of this Theorem, both the upper and lower bounds of the Q-function are maximized by the second-best beam index policy, which selects the beam index with the second-best preference during the beam-alignment phase. This policy will be evaluated numerically in the next section, against other MAB-based schemes proposed in the literature.
IV Numerical Results
In this section, we evaluate the performance of the second-best beam index selection scheme () with analog beamforming, and compare it with three other schemes. The first one is based on LTS, a popular MAB scheme [12]. In LTS, at each slot the action is chosen according to the belief distribution, i.e., . The second scheme is based on scanning the most-likely beam index () as proposed in [10] (first-best). The third scheme is based on UCB as proposed in [9]. We evaluate the performance of these three schemes in terms of the probability of alignment and spectral efficiency using Monte-Carlo simulation with iterations for each simulated point, with parameters as follows: , , , , , , , m, . The BS uses antennas and partitions the AoD space into sectors, each with a beamwidth of and with uniform prior ; the UE is isotropic, hence it uses antenna with a single sector. We use the beamforming design proposed in [13] for ULAs with antenna spacing . With this configuration, the main-lobe and side-lobe gains are best approximated by , .
In Fig. 2, we depict the probability of alignment achieved by the aforementioned schemes versus the pre-beamforming SNR . It can be observed that second-best has better performance than the other three schemes, with up to 7%, 10%, and 30% performance gains compared to first-best, LTS-based and UCB-based schemes. The performance gain of second-best is attributed to a better exploration-exploitation trade-off. The first-best scheme suffers from poor exploration since it ”greedily” chooses the beam index most likely to succeed, but fails to test other beams that may be under-explored, and is thus prone to make alignment errors. On the other hand, LTS-based scheme suffers from poor exploitation since it may scan least likely beams. The proposed second-best scheme, on the other hand, strikes a favorable trade-off between exploration and exploitation: instead of greedily choosing the most likely beam, it chooses the second most likely one, leading to better exploration than first-best; simultaneously, by not choosing beam pairs that are unlikely to succeed, it leads to a better exploitation compared to the LTS-based and UCB-based schemes. Finally, compared to UCB, second-best is better tailored to the structure of the model, since it aims to maximize the alignment probability at the end of the beam-alignment phase (see (16)), rather than the surrogate metric of UCB – the cumulative SNR accrued during beam-alignment.
In Fig. 3, we depict the spectral efficiency against the fraction of used for BA . We fix the SNR for beam-alignment as and the data-communication power as . Similar to Fig. 2, second-best outperforms the three other schemes, owing to improved alignment. The spectral efficiency is maximized at a unique maximizer : it increases initially with as the beam-alignment probability improves with . However, as increases beyond , this gain is offset by the increased overhead and reduced duration of the data communication phase.
V Conclusions
In this paper, we have formulated the beam-alignment problem as a Bayesian MAB problem. For the optimal control design, we have identified a simplified sufficient statistic referred to as the preference of beam pairs. Based on the preference and bounding of the value function, we have proposed a heuristic policy, which selects the beam pair with the second best-preference to scan. We have shown numerically that the proposed scheme outperforms the first-best, LTS, and UCB based beam-alignment schemes proposed in the literature.
Appendix: Proof of Theorem 1
Proof.
We prove the theorem using induction. Notice that from the definition of Q-function (III-B) and the optimal value function expression (19) for , we get
[TABLE]
where we have defined the preference update (15) as
[TABLE]
Moreover, using (14) and (11) we note that
[TABLE]
This yields
[TABLE]
where (b) follows by evaluating the integral in (a) for the two cases in (22), and noting that it is given by . Using Lemma 2 and (Proof.)(b), the optimal value function becomes
[TABLE]
Thus, the theorem statement holds for with equality. Assume it holds for . Using Lemma 2, we can bound
[TABLE]
[TABLE]
Using (III-B), the induction hypothesis (25) for and the above bound, we obtain
[TABLE]
Moreover, we note that
[TABLE]
By substituting (Proof.) and (27) into (Proof.), yields
[TABLE]
The first integral in (Proof.) is equal to and the second integral is found to be equal to
[TABLE]
Upon substituting these integrals into (Proof.) yields the following lower-bound to the Q-function,
[TABLE]
which proves the induction step (20), and whose maximization (see Lemma 2) yields (25).
Similarly, using the induction hypothesis (26) for and the upper-bound
[TABLE]
we obtain the following upper-bound to the Q-function,
[TABLE]
The integral above is equal to , which proves the induction step (21), hence
[TABLE]
Noting that (see Lemma 2), and upon substitution in (32) yields (26). ∎
Lemma 2**.**
We have that and
[TABLE]
Proof.
To show that , we proceed as follows. Clearly, if , then , hence
[TABLE]
maximized at . Therefore, we restrict without loss in performance. Next, we show that . Let . If , then and . Otherwise,
[TABLE]
Note that is decreasing in , minimized at , yielding, after algebraic steps,
[TABLE]
In both cases, . Upon substitution of in (22), yields (33). ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. R. Akdeniz, Y. Liu, M. K. Samimi, S. Sun, S. Rangan, T. S. Rappaport, and E. Erkip, “Millimeter Wave Channel Modeling and Cellular Capacity Evaluation,” IEEE Journal on Selected Areas in Communications , vol. 32, no. 6, pp. 1164–1179, June 2014.
- 2[2] T. S. Rappaport, R. W. Heath, R. C. Daniels, and J. N. Murdock, Millimeter wave wireless communications . Prentice Hall, 2015.
- 3[3] C. Jeong, J. Park, and H. Yu, “Random access in millimeter-wave beamforming cellular networks: issues and approaches,” IEEE Communications Magazine , vol. 53, no. 1, pp. 180–185, January 2015.
- 4[4] V. Desai, L. Krzymien, P. Sartori, W. Xiao, A. Soong, and A. Alkhateeb, “Initial beamforming for mm Wave communications,” in 48th Asilomar Conference on Signals, Systems and Computers , Nov 2014.
- 5[5] M. Hussain and N. Michelusi, “Throughput optimal beam alignment in millimeter wave networks,” in 2017 Information Theory and Applications Workshop (ITA) , Feb 2017, pp. 1–6.
- 6[6] R. A. Hassan and N. Michelusi, “Multi-user beam-alignment for millimeter-wave networks,” in 2018 Information Theory and Applications Workshop (ITA) , Feb 2018, pp. 1–6.
- 7[7] N. Michelusi and M. Hussain, “Optimal beam-sweeping and communication in mobile millimeter-wave networks,” in 2018 IEEE International Conference on Communications (ICC) , May 2018, pp. 1–6.
- 8[8] M. Hussain and N. Michelusi, “Energy-efficient interactive beam alignment for millimeter-wave networks,” IEEE Transactions on Wireless Communications , vol. 18, no. 2, pp. 838–851, Feb 2019.
