Upper-Confidence Bound for Channel Selection in LPWA Networks with Retransmissions
Remi Bonnefoi (IETR), Lilian Besson (IETR), Julio Manco-Vasquez, (IETR), Christophe Moy (IETR)

TL;DR
This paper investigates the use of UCB-based Multi-Arm Bandit algorithms for channel selection in LPWA IoT networks, demonstrating improved transmission success rates by leveraging retransmission data.
Contribution
It introduces and evaluates UCB-based heuristics for IoT channel access, highlighting their effectiveness and simplicity compared to more complex strategies.
Findings
UCB algorithms significantly improve successful transmission probabilities.
Pure UCB channel access performs as well as more complex methods.
Retransmission data enhances the contextual information for learning.
Abstract
In this paper, we propose and evaluate different learning strategies based on Multi-Arm Bandit (MAB) algorithms. They allow Internet of Things (IoT) devices to improve their access to the network and their autonomy, while taking into account the impact of encountered radio collisions. For that end, several heuristics employing Upper-Confident Bound (UCB) algorithms are examined, to explore the contextual information provided by the number of retransmissions. Our results show that approaches based on UCB obtain a significant improvement in terms of successful transmission probabilities. Furthermore, it also reveals that a pure UCB channel access is as efficient as more sophisticated learning strategies.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Upper-Confidence Bound for Channel Selection in LPWA Networks with Retransmissions
††thanks: This publication is supported by the French National Research Agency (ANR), under the projects SOGREEN and EPHYL (grants N ANR-14-CE28-0025-02 and N ANR-16-CE25-0002-03), by Région Bretagne, France, by École Normale Supérieure de Paris-Saclay. by European Union, through the European Regional Development Fund (ERDF), and by Ministry of Higher Education and Research, Brittany and Rennes Métropole, through the CPER Project SOPHIE / STIC & Ondes.
Rémi Bonnefoi1, Lilian Besson1, Julio Manco-Vasquez1, and Christophe Moy2
1 IETR / CentraleSupélec Campus de Rennes, F- Cesson-Sévigné, France,
Remi.Bonnefoi,Lilian.Besson,JulioCesar.MancoVasquez@CentraleSupelec.fr
2 Univ Rennes, CNRS, IETR - UMR , F-, Rennes, France
Abstract
In this paper, we propose and evaluate different learning strategies based on Multi-Arm Bandit (MAB) algorithms. They allow Internet of Things (IoT) devices to improve their access to the network and their autonomy, while taking into account the impact of encountered radio collisions. For that end, several heuristics employing Upper-Confident Bound () algorithms are examined, to explore the contextual information provided by the number of retransmissions. Our results show that approaches based on obtain a significant improvement in terms of successful transmission probabilities. Furthermore, it also reveals that a pure channel access is as efficient as more sophisticated learning strategies.
Index Terms:
Low Power Wide Area, Multi-Armed Bandits, Upper-Confident Bound, retransmissions, Internet of Things.
I Introduction
Nowadays, the Internet of Things (IoT) and in particular the Low Power Wide Area (LPWA) technology is considered a main driver for a vast variety of application that will support the communications among a large number of devices. In fact, network operators are starting to deploy Machine to Machine (M2M) solutions using LPWA networking technologies [1]. For instance, LoRaWAN and SigFox technologies have been most adopted in the monitoring of large scale systems (e.g., smart cities, metering), where a large number of devices compete for the transmission of their packets in the unlicensed Industrial, Scientific and Medical (ISM) bands.
Nevertheless, this demand to fit a growing number of energy-limited end-devices requires the development of contention-based protocol more tailored for LPWAN technologies. Thus, novel access mechanisms considering collision-avoidance methods need to be addressed to avoid degrading the network performance in these unlicensed bands. In fact, the number of packet collisions increases as more devices without coordination share the same band. Hence, an important concern in the Medium Access (MAC) design is to reduce the Packet Loss Ratio (PLR) due to the interference caused by the collisions among the devices.
In this regard, in the context of Cognitive Radio [2, 3], Multi-Arm Bandit (MAB) algorithms [4, 5, 6] have been recently proposed as a potential solution for channel access in LPWA networks [7, 8, 9]. For instance in [9], the impact of non-stationarity on the network performance using MAB algorithms is studied. In this work, low-cost algorithms following two well-known approaches, such as the Upper-Confidence Bound () [4, 5], and the Thompson Sampling (TS) algorithms [10] have reported encouraging results. Other recent directions include theoretical analysis [11, 12], and realistic empirical simulations [13, 14], of the application of MAB algorithms for slotted wireless protocols in a decentralized manner, or applications to multi-hoping networks [15, 16]. None of the above mentioned articles discusses in detail the impact of retransmissions on the performance of MAB learning algorithms as we do in this paper.
The aim of this paper is to assess the performance of MAB algorithms [6] for channel selection in LPWA networks, while taking into account the impact of retransmissions on the network performance. For this reason, several decision making strategies are applied after a first retransmission (i.e., when a collision occurs). Proposed approach employs contextual information provided by the number of retransmissions, and implemented at each device, so that no coordination among them is needed. Moreover, our -based heuristics show low complexity making them suitable for being embedded in LPWA devices.
The contributions of this paper are summarized as follows:
- •
Firstly, we provide a close form approximation of the radio collision probability after a first retransmission. By doing this, we highlight the need to develop a learning approach for channel selection upon collision.
- •
Secondly, different heuristics are proposed to cope with retransmissions.
- •
Lastly, we conduct simulations in order to compare the performance of the proposed heuristics with a naive uniform random approach, and a strategy (i.e., without any learning for the retransmissions).
The rest of the paper is organized as follows. First the system model is introduced in Section II. Our motivations are exposed in Section III, and a formal description of the MAB learning algorithms is given in Section IV. The proposed -based heuristics are presented in Section V, while the corresponding numerical results are shown in Section VI. Finally, some conclusions are drawn in Section VII.
II System model
II-A LPWA Network
We consider in this paper an LPWA network composed of a gateway and a large number of end-devices that regularly send short data packets, where channels () are available for the transmission of their packets.
We assume that this network is constituted by two types of devices: on one hand, we have static devices that operate in one channel111 Note that, for unlicensed bands, this definition also encompasses any device following a different standard or trying to establish communication with gateways of other networks. in order to communicate with the gateway. On the other hand, there are IoT devices, that possess the additional advantage of being able to select any of the available channels to perform their transmissions.
Regardless the type of devices, each of them follows a slotted ALOHA protocol [17], and has a probability to transmit a packet in a time slot. We make the hypothesis that the transmission is successful if the channel is available, otherwise upon radio collision, these devices will attempt to transmit their packet up-to times, with . Note that, every retransmission is carried out after a random back-off time, uniformly distributed in , where is the length of the back-off interval.
II-B Model of our IoT devices
The aforementioned contention process can be described by a Markov chain model [18] similar to the one presented in [19], as it is depicted in Fig. 1. A device containing a packet for transmission goes from an idle state to a transmission state, while considering retransmissions due to different collision probabilities, i.e., , at each back-off stage. At each time slot, a transition from an idle state to a transmission state (denoted as Trans.) occurs if a packet transmission is required, while waiting states (denoted as Wait), correspond to a back-off interval.
A device aims to select a channel with the highest probability of successful transmission, for which it resorts to a reinforcement learning approach. It is formulated as a MAB problem, where each channel (also called arms) is viewed as a gambling machine (bandit), and each bandit has a reward. Then, at every trial, a device chooses a channel that maximizes the sum of the collected rewards. These rewards are the acknowledgment (Ack) signals received after transmitting packets to the gateway. In this way, a successful transmission is considered when an acknowledgment is received, and a learning approach is employed to select the best channel.
We address the problem of channel selection taking into account the described Markov model for the retransmissions of end-devices. It motivates our present work for which we consider the retransmissions in the analysis of MAB algorithms.
III Motivations for the proposed approach
When a device experiments a collision, it goes in a back-off state to retransmit the same packet on a channel. If all devices remain in the same channel for retransmissions, it could result in a sequence of successive collisions with the same devices’ packets that previously collided. Thus, it seems interesting to consider in the decision making policy the possibility for a device to retransmit in a different channel. One of our motivations to develop new MAB algorithms for our problem is this option of using a different communication channels between the first transmission and the next retransmissions.
By considering this possibility, the device will have to learn more, thus, we expect the learning time to be longer, but it could be possible that the final performance gain (i.e., in terms of successful transmission rate) increases too. The next Section VI presents analysis to check this performance gain, for various heuristics based on the algorithm.
Here after, we start by presenting a mathematical derivation that backups this idea. To do so, we study the collision probabilities considering the Markov process depicted in Fig. 1, and foresee the impact of addressing bandit strategies, as well as setting guidelines for the design of heuristic approaches.
III-A Probability of collision at the second transmission slot
As it is well known, having a collision during an access time can be overcome by a retransmission procedure (this can take several retransmission attempts). What interest us here, is to obtain a mathematical approximation of the collision probability at the second transmission slot , as a function of the first collision probability .
We consider two hypotheses and defined as,
- •
: The probability , is composed by the sum of two probabilities: i) the probability of colliding consecutively twice, i.e., the devices that collide at a given time slot and collide again when retransmitting their packets, and ii) the probability of collision among devices that did not collide in the same previous collision. Moreover, we suppose that the number of devices involved in a collision is small in comparison to the total number of devices.
- •
: The total number of the back-off stages at time is constant, and it is assumed to be large enough to consider that no device will ever be in the last failure state (this case is the one on the right side in Figure 1), after successive failed retransmissions.
Considering one device and a channel, we denote the probability that it is transmitting a packet for the time in a given time slot (with ), and let be the probability that it transmits a packet. We consider active devices following the same policy.
We assume to be in the steady state [18], in our Markov chain model depicted in Figure 1, and thus the probabilities no longer depend on the slot number (i.e., ). Therefore, the probability that this device has a collision at the first transmission is , and has the following expression
[TABLE]
Moreover, from (1) we define the probability that involves the collision of packets sent by each IoT device (for any ), during the first transmission slot, and is defined by the following equation
[TABLE]
As explained above, if an IoT device experiences a collision at the first transmission, it proceeds for the retransmission of its packet after a random back-off interval. We denote the probability to have a collision with a packet involved in the previous collision. Under the assumption, the number of packets involved in the same previous collision remains very small in comparison to the total number of devices that may transmit during this time. In other words, this collision probability does not depend on previous retransmissions and is equal to . So, the probability that the same device’s packet experiences again a collision at the second time slot is
[TABLE]
If the device has a collision at the first attempt, we consider the probability that it has a collision with exactly packets (for any ), and that at least one of the devices involved in this first collision chooses the same back-off interval,
[TABLE]
Besides, is the conditional probability of collision with a packet sent by a device involved in the previous collision given that the packet experienced collision at its first transmission. Hence, under hypothesis , we can use Bayes theorem and the law of total probability to relate with , and the different probabilities that a device experienced a collision during the first slot and has the same back-off interval for its retransmission is,
[TABLE]
Therefore, the expression of is
[TABLE]
Once again under , assuming that the number of devices involved in the first collision is small compared to , the first terms of the sum in (III-A) are predominant. We derive,
[TABLE]
Moreover, for these terms, is small compared to , and so can be approximated to . Thus it gives,
[TABLE]
Assuming amounts to consider that . As a consequence, the sum in equation (7) can be supplemented by negligible terms,
[TABLE]
We use the binomial theorem to compute the sum in (8), and we rewrite the expression of as
[TABLE]
Finally, our approximation of can be obtained by inserting (9) in (2).
III-B Behaviour analysis of and
In order to assess the proposed approximation, we suppose a unique channel where all the devices follow the same contention Markov process. We simulate an ALOHA protocol with a maximum number of retransmissions , a maximum back-off interval , and a transmission probability . In Fig. 2, we show the collision probabilities for different number of devices (from up-to ), for both and .
From this simulations, we can verify that our approximation is very precise for lower values (i.e., red and orange curves are quite close). Moreover, a significant gap between and , of up-to , can be observed, which suggests us to resort to MAB algorithms for the channel selection for both the first transmission and next retransmissions.
III-C Learning is useful for non-congested networks
It is worth to highlight that, if we write (2) as , then it is obvious that is always larger than (as ). But for large values of , so the gap gets small, and for small values of the gap is significant. Moreover, we can verify (e.g., numerically or by differentiating) that the gap decreases when increases (for fixed and ). This backups mathematically the observation we made from Fig. 2: the smaller , the larger is the gap between and .
We interpret this fact in two different situations. On one hand, in a congested network, when devices suffer from a large probability of collision on their first transmission (i.e., is not so small), then and so devices cannot really hope to reduce their collision probabilities even if the use a different channel for retransmission. On the other hand, if is small enough, i.e., in a network not yet too congested, then our derivation shows that , meaning that the possible gain of retransmitting in a different channel that the one used for the first transmission can be large, in terms of collision probability (e.g., up-to in this experimental setting). In other words, when learning can be useful (small ), learning to retransmit in a different channel can have a large impact on the global collision rate, thus justifying our approach.
IV A well-known MAB Algorithm:
Without loss of generality, we have adopted a well-studied stochastic MAB learning algorithm, where the reward distributions are unknown and assumed to be independent and identically distributed (i.i.d). The arms model the channels denoted as , and the players, the dynamic devices, learn the distributions to be able to progressively focus on the best arm, i.e., the arm with largest mean representing the mean availability of a given channel .
Before presenting our proposed heuristics, we describe a bandit algorithm [4]. It has reported to be efficient, while featuring a low complexity for its implementation. For this reason, it has been employed for IoT applications [9], and we employ this approach to develop our proposals.
IV-A The algorithm
A first approach is to only use an empirical mean estimator of the rewards in every channel, and select the channel with highest estimated mean at every time step; but this greedy approach is known to fail dramatically [5]. Indeed, with this policy, the selection of arms depends too much on the first draws: if the first transmission in one channel fails and the first one on other channels succeeds, the device will never use the first channel again, even if it is the best one (i.e., the most available, in average).
Rather than relying on the empirical mean reward, algorithms instead use a confidence interval on the unknown mean of each arm, which can be viewed as adding a “bonus” exploration to the empirical mean. They follow the “optimism-in-face-of-uncertainty” principle: at each step, they play according to the best model, as the statistically best possible arm (i.e., the highest ) is selected.
More formally, for one device, let be the number of times the channel (for ) was selected up-to time , for for any ,
[TABLE]
where is an indicator function that is equal to , if the IoT device chooses, for its -th transmission, the channel , and [math] otherwise. The empirical mean estimator of channel is defined as the mean reward obtained up-to time ,
[TABLE]
where is the reward obtained after transmission in channel at time ( for a successful transmission, and [math] otherwise) A confidence term is given by [5],
[TABLE]
where refers to an exploration coefficient222 In fact, the larger this coefficient is, the longer the exploration, while the algorithm is proven to be order optimal for [6], and has reported a good performance for lower values of ., that we chose equal to , as suggested in [20] and as done in previous works [7, 9]. Then, an upper confidence bound in each channel is defined as
[TABLE]
Finally, the transmission channel at time step is the one maximizing this index , as it is the one expected to be the best one at the current time step ,
[TABLE]
The algorithm is implemented independently by each device, and we present it in Algorithm 1. Note that a device using this first approach is only able to select a channel for the first and all the corresponding retransmissions of a packet.
V Proposed Heuristics
A device that implements the UCB algorithm is led to focus is transmissions and retransmissions in the channel which has been identified as the best. As explained in Section III, focusing in one channel increases the collision probability in retransmissions. In this Section, we describe the proposed heuristics for the channel selection in a retransmission. It is carried out taking into account that a device can incorporate a different channel selection strategy while being in a back-off state. Hence, a natural question is to evaluate whether using this additional contextual information can improve the performance of a learning policy.
For that end, all of our heuristics comprise two stages: the first stage is a algorithm employed for the first attempt to transmit, and the second stage is another algorithm used for channel selections for the next retransmissions.
We present below four heuristics for this second stage (short names in “quotes” correspond to the legend on Figures 3, 4).
V-A Uniform random retransmission (“Random”)
In this first proposal, the device uses a random channel selection, following a uniform distribution (in ). It is described below in Algorithm 2.
V-B * for retransmission (“Only ”)*
Instead of applying a random channel selection, another heuristic is to use a second algorithm in the second stage. In other words, we expect that this algorithm is able to learn the best channel to retransmit a packet. It is described in Algorithm 3, and it is still a practical approach, since the storage requirements and time complexity remains linear w.r.t. the number of channels (i.e., of order ).
Note that, we use the superscript to denote the variables , and , related to the algorithm employed for the retransmission.
V-C * different s for retransmission (“ ”)*
Another heuristic is to not use the same algorithm no matter where the collision occurred, but to use different algorithms. Meaning that after a failed first transmission in channel , the device relies on the -th algorithm to decide its retransmission. The corresponding algorithm is depicted in Algorithm 4. Each of these algorithms are denoted using the superscript , for .
Although, this approach increases the complexity and storage requirements (of order ). For our LPWA networks of interest, such as LoRaWAN, the cost of its implementation is still affordable, since a small number of channels is used. For instance, for channels, the memory to storage algorithms is of the order of the requirements to storing one.
V-D Delayed for retransmission (“Delayed ”)
This last heuristic is a composite of the random retransmission (Algorithm 2) and the retransmission (Algorithm 3) approaches. Instead of starting the second stage directly from the first retransmission, we introduce a fixed delay , , and start to rely on the second stage after transmissions. The selection for the first steps is handled with the random retransmission.
The idea behind this delay is to allow the first stage to start learning the best channel, before starting the second stage (see details in Algorithm 5). The number of transmissions to wait before applying the second algorithm is denoted by , it has to be fixed before-hand.
Note that, we use the superscript to denote the variables related to the delayed second-stage algorithm.
VI Simulations to compare our heuristics
We simulate our network considering devices following the contention Markov process described in Section II, and a LoRa standard with channels. Each device is set to transmit with a fixed probability , i.e., a packet about every minutes for time slots of .
For the evaluation of the proposed heuristics, a total number of time slots is considered, and the results are averaged over independent random simulations.
In a first scenario, we consider a total number of IoT devices, with a non-uniform repartition of static devices given by for the four channels. In other words, the channels are occupied , , , and of time, and the contention Markov process considered is given by , and . In Fig. 3, we show the successful transmission rate versus the number of slots, for all the proposed heuristics.
A first result is that all the heuristics clearly outperform the non-learning approach that simply use random channel selection for both transmissions and retransmissions (i.e., the no curve). The improvement of the heuristics over the non-learning approach is evident, and for every heuristic that use a kind of learning mechanism it can be observed a successful transmission rate that increases rapidly (or equivalently an PLR decreasing). Moreover, all of these approaches show a fast convergence making them suitable for the targeted application. It is also worths mentioning that the employment of the same algorithm for retransmissions denoted here as “Only ” achieves the best performance, while a “Random” retransmission features a slight degradation. This result can be explained as follows: the loss of performance related to the separation of information for several algorithms is greater than the gain obtained by considering the first transmissions and retransmissions separately.
We also consider in our analysis the case where , and using ALOHA protocol, a statistic distribution of the devices about for the four channels, and IoT devices. The corresponding results are depicted in Fig. 4. In this case the successful transmission rate is degraded compared with achieved results in Fig. 3, this can be explained with the fact that we are considering in our network more devices that increase the collision probability. It is important to highlight, that the “Random” retransmission heuristic shows a poor performance in comparison to the other heuristics, and it can be attributed to the fact that the number of retransmission is increased, and consequently a learning approach is able to take advantage of it. Furthermore, the “”, “ ” and “Delayed ” heuristics behave similarly than “Only ”, after a similar convergence time.
The conclusions we can draw from depicted results are twofold. First, MAB learning algorithms are very useful to reduce the collision rate in LPWA networks, a gain of up-to of successful transmission rate is observed after convergence. A second conclusion that can be highlighted is that, using learning mechanisms for retransmissions can be an interesting way to reduce collisions in networks with massive deployments of IoT as this can be checked in Fig. 4, where the random retransmission heuristic is not very advantageous in front of the -based approaches that use learning for channel selection during the retransmission procedure.
VII Conclusions
In this paper, we presented a retransmission model of LPWA networks based on an ALOHA protocol, slotted both in time and frequency, in which dynamic IoT devices can use machine learning algorithms, to improve their PLR when accessing the network. The main novelty of this model is to address the packet retransmissions upon radio collision, by using a Multi-Armed Bandit framework. We presented and evaluated several learning heuristic that try to learn how to transmit and retransmit in a smarter way, by using the algorithm for channel selection for first transmission, and different proposals based on for the retransmissions upon collisions.
We showed that incorporating learning for the transmission is needed to achieve optimal performance, with significant gain in terms of successful transmission rate in networks with a large number of devices (up-to in the example network). Our empirical simulations show that each of our proposed heuristic outperforms a naive random access scheme. Surprisingly, the main take-away message is that a simple learning approach, that retransmit in the same channel, turns out to perform as well as more complicated heuristics.
Future works
The utility and impact of the proposed approaches for LPWA networks motivates us to address several subjects as future works. Among them, the non-stationarity of the channel occupancy caused by the learning policy employed by the IoT devices. For that end, modifications of MAB algorithms have been proposed, such as Sliding-Window- or Discounted- [21] or more recently M- [22], that nevertheless have not been explored for the targeted problem.
In order to validate our results in a realistic experimental setting and not only with simulations, future works include a hardware implementation of the analyzed models to complete our recent works [23, 24]. A hardware demonstrator could be also benefit to study other settings by removing some hypotheses, for instance by studying a similar model in non-slotted time.
Note on the simulation code
The source code (MATLAB or Octave) used for the simulations and the figures is open-sourced under the MIT License, at Bitbucket.org/scee_ietr/ucb_smart_retrans.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] U. Raza, P. Kulkarni, and M. Sooriyabandara, “Low power wide area networks: An overview,” IEEE Communications Surveys Tutorials , vol. 19, no. 2, pp. 855–873, 2017.
- 2[2] J. Mitola and G. Q. Maguire, “Cognitive Radio: making software radios more personal,” IEEE Personal Communications , vol. 6, pp. 13–18, Aug 1999.
- 3[3] S. Haykin, “Cognitive Radio: Brain-Empowered Wireless Communications,” IEEE Journal on Selected Areas in Communications , vol. 23, no. 2, pp. 201–220, 2005.
- 4[4] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time Analysis of the Multi-armed Bandit Problem,” Machine Learning , vol. 47, no. 2, pp. 235–256, 2002.
- 5[5] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The Non-Stochastic Multi-Armed Bandit Problem,” SIAM Journal on Computing , vol. 32, no. 1, pp. 48–77, 2002.
- 6[6] S. Bubeck, N. Cesa-Bianchi, et al. , “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Foundations and Trends® in Machine Learning , vol. 5, no. 1, pp. 1–122, 2012.
- 7[7] R. Bonnefoi, C. Moy, and J. Palicot, “Improvement of the LPWAN AMI backhaul’s latency thanks to reinforcement learning algorithms,” EURASIP Journal on Wireless Communications and Networking , vol. 2018, no. 1, p. 34, 2018.
- 8[8] A. Azari and C. Cavdar, “Self-organized Low-power Io T Networks: A Distributed Learning Approach,” in IEEE Globecom™ , (Abu Dhabi, UAE), Dec 2018.
