DeepRMSA: A Deep Reinforcement Learning Framework for Routing,   Modulation and Spectrum Assignment in Elastic Optical Networks

Xiaoliang Chen; Baojia Li; Roberto Proietti; Hongbo Lu; Zuqing Zhu; S.; J. Ben Yoo

arXiv:1905.02248·cs.NI·September 4, 2019

DeepRMSA: A Deep Reinforcement Learning Framework for Routing, Modulation and Spectrum Assignment in Elastic Optical Networks

Xiaoliang Chen, Baojia Li, Roberto Proietti, Hongbo Lu, Zuqing Zhu, S., J. Ben Yoo

PDF

TL;DR

DeepRMSA introduces a deep reinforcement learning framework for routing, modulation, and spectrum assignment in elastic optical networks, improving efficiency and stability through novel training mechanisms.

Contribution

It develops a new deep RL-based RMSA policy learning method with episode-based and window-based training mechanisms for EONs.

Findings

01

Reduces blocking probability by over 20%.

02

Stabilizes training with the proposed DeepRMSA-FLX.

03

Outperforms baseline methods in efficiency.

Abstract

This paper proposes DeepRMSA, a deep reinforcement learning framework for routing, modulation and spectrum assignment (RMSA) in elastic optical networks (EONs). DeepRMSA learns the correct online RMSA policies by parameterizing the policies with deep neural networks (DNNs) that can sense complex EON states. The DNNs are trained with experiences of dynamic lightpath provisioning. We first modify the asynchronous advantage actor-critic algorithm and present an episode-based training mechanism for DeepRMSA, namely, DeepRMSA-EP. DeepRMSA-EP divides the dynamic provisioning process into multiple episodes (each containing the servicing of a fixed number of lightpath requests) and performs training by the end of each episode. The optimization target of DeepRMSA-EP at each step of servicing a request is to maximize the cumulative reward within the rest of the episode. Thus, we obviate the need…

Equations13

n = ⌈ \frac{b}{m \cdot C _{g r i d}^{B P S K}} ⌉,

n = ⌈ \frac{b}{m \cdot C _{g r i d}^{B P S K}} ⌉,

Γ_{t} = t^{'} \in [t, \infty) \sum γ^{t^{'} - t} \cdot r_{t^{'}},

Γ_{t} = t^{'} \in [t, \infty) \sum γ^{t^{'} - t} \cdot r_{t^{'}},

Γ_{t^{'}} = i \in [0, N - 1], χ_{t^{'} + i} \in Λ \sum γ^{i} \cdot r_{t^{'} + i} .

Γ_{t^{'}} = i \in [0, N - 1], χ_{t^{'} + i} \in Λ \sum γ^{i} \cdot r_{t^{'} + i} .

δ_{t^{'}} = Γ_{t^{'}} - f_{θ_{v}} (s_{t^{'}}),

δ_{t^{'}} = Γ_{t^{'}} - f_{θ_{v}} (s_{t^{'}}),

L_{θ_{p}} =

L_{θ_{p}} =

- \frac{α}{N} χ_{t^{'}} \in Λ \sum a \in A \sum f_{θ_{p}} (s_{t^{'}}, a) lo g f_{θ_{p}} (s_{t^{'}}, a),

L_{θ_{v}} = \frac{1}{N} χ_{t^{'}} \in Λ \sum (f_{θ_{v}} (s_{t^{'}}) - Γ_{t^{'}})^{2} .

L_{θ_{v}} = \frac{1}{N} χ_{t^{'}} \in Λ \sum (f_{θ_{v}} (s_{t^{'}}) - Γ_{t^{'}})^{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

DeepRMSA: A Deep Reinforcement Learning Framework for Routing, Modulation and Spectrum Assignment in Elastic Optical Networks

Xiaoliang Chen, , Baojia Li, Roberto Proietti, Hongbo Lu, Zuqing Zhu, , S. J. Ben Yoo X. Chen, R. Proietti, H. Lu and S. J. B. Yoo are with the Department of Electrical and Computer Engineering, University of California, Davis, Davis, CA 95616, USA (Email: [email protected], [email protected]).B. Li and Z. Zhu are with the School of Information Science and Technology, University of Science and Technology of China, Hefei, Anhui 230027, P. R. China (Email: [email protected]).Manuscript received Dec. 8, 2018.

Abstract

This paper proposes DeepRMSA, a deep reinforcement learning framework for routing, modulation and spectrum assignment (RMSA) in elastic optical networks (EONs). DeepRMSA learns the correct online RMSA policies by parameterizing the policies with deep neural networks (DNNs) that can sense complex EON states. The DNNs are trained with experiences of dynamic lightpath provisioning. We first modify the asynchronous advantage actor-critic algorithm and present an episode-based training mechanism for DeepRMSA, namely, DeepRMSA-EP. DeepRMSA-EP divides the dynamic provisioning process into multiple episodes (each containing the servicing of a fixed number of lightpath requests) and performs training by the end of each episode. The optimization target of DeepRMSA-EP at each step of servicing a request is to maximize the cumulative reward within the rest of the episode. Thus, we obviate the need for estimating the rewards related to unknown future states. To overcome the instability issue in the training of DeepRMSA-EP due to the oscillations of cumulative rewards, we further propose a window-based flexible training mechanism, i.e., DeepRMSA-FLX. DeepRMSA-FLX attempts to smooth out the oscillations by defining the optimization scope at each step as a sliding window, and ensuring that the cumulative rewards always include rewards from a fixed number of requests. Evaluations with the two sample topologies show that DeepRMSA-FLX can effectively stabilize the training while achieving blocking probability reductions of more than $20.3\%$ and $14.3\%$ , when compared with the baselines.

Index Terms:

Elastic optical networks (EONs), Routing, modulation and spectrum assignment (RMSA), Deep reinforcement learning, Asynchronous advantage actor-critic algorithm.

I Introduction

The explosive growth of emerging applications (e.g., cloud computing) and the popular adoption of new networking paradigms (e.g., the Internet of Things) are demanding a new network infrastructure that can support dynamic, high-capacity and quality-of-transmission (QoT)-guaranteed end-to-end services. Recently, elastic optical networking (EON) has emerged as one of the most promising networking technologies for the next-generation backbone networks [1]. Compared with the traditional fixed-grid (e.g., $50$ GHz) wavelength-division multiplexing (WDM) scheme, EON can flexibly set up bandwidth-variable superchannels by grooming series of finer-granularity (e.g., $6.25$ GHz) subcarriers and adapting the modulation formats according to the QoT of lightpaths [2].

The flexible resource allocation mechanisms in EON, on the other hand, make the corresponding service provisioning designs more complicated. To fully exploit the benefits of such flexibilities and realize cost-effective EON, previous studies have intensively investigated the routing, modulation and spectrum assignment (RMSA) problem for EON [3]. The authors of [4, 5, 6] first proposed integer linear programming (ILP) models for solving the static RMSA problems, where all the lightpath requests are assumed to be known in prior. While the ILP models can provide the optimal solutions to the RMSA problems, they are proved to be $\mathcal{NP}$ -hard [4] and are intractable for large-scale problems. In this context, a number of heuristic or approximation algorithms have been developed. In [4], Wang et al. proposed two algorithms, namely, balanced load spectrum allocation and shortest path with maximum spectrum reuse, to minimize the maximum required spectrum resources in an EON accounting for the given traffic demand. The authors of [5] presented a simulated annealing approach for determining the servicing order of lightpath requests and applied the k-shortest path routing and first-fit (KSP-FF) scheme to calculate the RMSA solution for each request afterward. In [7, 6], the authors investigated to leverage genetic algorithms to realize joint RMSA optimizations. A conflict graph based two-phase algorithm with proved performance level was proposed in [8]. For more heuristic RMSA designs, such as random-fit, exact-fit and most-used spectrum assignment, readers can refer to [3].

Unlike static RMSA problems for which explicit optimization models can be formulated, optimizing dynamic lightpath provisioning in EONs (i.e., dynamic RMSA problems) is more challenging. The dynamic arrivals and departures of lightpath requests as well as the uncertainty of future traffic could dramatically destabilize the EON state and thus deteriorate the efficiency of the optimizations based on the current state. To cope with such dynamics, a few dynamic RMSA designs have been reported lately, in addition to those that can be derived from the aforementioned static RMSA algorithms. The authors in [9] applied the multi-path routing scheme and developed several empirical weighting methods taking into account path lengths, link spectrum utilization, and other features to realize state-aware dynamic RMSA. In [10], Yin et al. investigated the spectrum fragmentation effect in dynamic lightpath provisioning and proposed a fragmentation-aware RMSA algorithm to mitigate spectrum fragmentation. More aggressive service reconfiguration approaches, e.g., spectrum defragmentation [11, 12], have also been proposed as complements to normal RMSA algorithms to enable periodical service consolidations but at the expense of high operational costs. However, the existing works only apply fixed RMSA policies regardless of the time-varying EON states or rely on simple empirical policies based on manually extracted features, i.e., lack of comprehensive perceptions of the holistic EON states, and therefore are unable to achieve real adaptive service provisioning in EONs.

In the meantime, recent advances in deep reinforcement learning (DRL) have demonstrated beyond human-level performance in handling large-scale online control tasks [13, 14]. By parameterizing policies with deep neural networks (DNNs) [15], DRL enables learning agents to perceive complex system states from high-dimensional input data (e.g., screenshots and traffic matrices) and progressively learn correct policies through experiences of repeated interactions with the target systems. The application of DRL in the communication and networking domain has received intensive research interests during the past two years [16, 17, 18]. In [17], the authors enhanced the general deep Q-learning framework in [13] with novel exploration and experience replay techniques to solve the traffic engineering problem. The authors of [18] presented a DRL-based framework for datacenter network management and demonstrated a DRL agent which can learn the optimal topology configurations with respect to different application profiles. Nevertheless, the application of DRL in optical networking, or in particular, for addressing the RMSA problem, has not been investigated.

In this paper, we propose DeepRMSA, a DRL-based RMSA framework for learning the optimal online RMSA policies in EONs. The contributions of this paper can be summarized as follows. 1) We propose, for the first time, a DRL framework for optical network management and resource allocation, i.e., RMSA. 2) We propose two training mechanisms for DeepRMSA, taking into account the unique characteristics of the RMSA problem. 3) Numerical results verify the superiority of DeepRMSA over the state-of-art heuristic algorithms.

The rest of the paper is organized as follows. Section II presents the RMSA problem formulation. Section III details the DeepRMSA framework. In Section IV, we elaborate on the design of DeepRMSA, including the modeling and the training mechanisms. Then, in Section V, we show the performance evaluations and related discussions. Finally, Section VI concludes the paper.

II Problem Formulation

Let $G(V,E,F)$ denote an EON topology, where $V$ and $E$ represent the sets of nodes and fiber links, $F=\left\{F_{e,f}\mid_{e,f}\right\}$ contains the state of each frequency slot (FS) $f\in[1,f_{0}]$ on each fiber link $e\in E$ . We model a lightpath request from node $o$ to $d$ ( $o,d\in V$ ) as $\mathcal{R}_{t}(o,d,b,\tau)$ , with $b$ Gb/s and $\tau$ denoting the bandwidth requirement and service duration, respectively. To provision $\mathcal{R}_{t}$ , we need to compute an end-to-end routing path $\mathcal{P}_{o,d}$ , determine a proper modulation format $m$ to use for QoT assurance, and allocate a number of spectrally contiguous FS’s (i.e., the spectrum contiguous constraint) on each link along $\mathcal{P}_{o,d}$ according to $b$ and $m$ . In this work, we assume that the EON is not equipped with the spectrum conversion capability. Therefore, the spectra allocated on different fibers to $\mathcal{R}_{t}$ must align (i.e., the spectrum continuous constraint). We adopt the impairment-aware model in [19] to decide the modulation format according to the physical distance of $\mathcal{P}_{o,d}$ . Specifically, the number of FS’s needed can be computed as,

[TABLE]

where $C_{grid}^{BPSK}$ is the data rate an FS of BPSK signal can support and $m\in\left[1,2,3,4\right]$ corresponds to BPSK, QPSK, 8-QAM and 16-QAM, respectively. The static RMSA problem (i.e., offline network planning) gives a set of permanent lightpath requests $\mathcal{R}=\left\{\mathcal{R}_{t}\mid_{t}\right\}$ ( $\tau\rightarrow\infty$ ) and requires provisioning all of them in a batch following the link capacity constraint [4]. The objective of the static RMSA problem is to minimize the total spectrum usage. Unlike the static problem where requests are known in prior, in the dynamic RMSA problem (i.e., online lightpath provisioning) being considered in this work, lightpath requests arrive and expire on-the-fly and need to be serviced immediately upon their arrivals. The dynamic RMSA problem aims at minimizing the long-term request blocking probability, which is defined as the ratio of the number of blocked requests to the total number of requests over a period.

III DeepRMSA Framework

Fig. 1 shows the schematic of DeepRMSA. DeepRMSA takes advantage of the software-defined networking (SDN) paradigm for centralized and automated control and management of the EON data plane [20]. Specifically, a remote SDN controller interacts with the local SDN agents to collect network states and lightpath requests, and distribute RMSA schemes, while the SDN agents drive the actual device configurations according to the received commands. The operation principle of DeepRMSA is designed based on the framework of DRL. Upon receiving a lightpath request $\mathcal{R}_{t}$ (step 1), the SDN controller retrieves from the traffic engineering database key network state representations, including the in-service lightpaths, resource utilization and topology abstraction, and invokes the feature engineering module to generate tailored state data $s_{t}$ for DeepRMSA (step 2). The DNNs of DeepRMSA read the state data and output an RMSA policy $\pi_{t}$ for the SDN controller (step 3). The controller in turn takes an action $a_{t}$ (i.e., determining an RMSA scheme) based on $\pi_{t}$ and attempts to set up the corresponding lightpath (step 4). The reward system receives the outcome related to the previous RMSA operations as feedback (step 5) and produces an immediate reward $r_{t}$ for DeepRMSA. $r_{t}$ , together with $s_{t}$ and $a_{t}$ , are stored in an experience buffer (step 6), from which DeepRMSA derives training signals for updating the DNNs afterward (step 7). The objective of DeepRMSA upon servicing $\mathcal{R}_{t}$ is to maximize the long-term cumulative reward defined as,

[TABLE]

where $\gamma\in[0,1]$ is the discount factor that decays future rewards. Eventually, DeepRMSA enables a self-learning capability that can learn and adapt RMSA policies through dynamic lightpath provisioning. Note that, by deploying multiple parallel DRL agents, each for a particular application or functionality (e.g., protection [20] and defragmentation [12]), we can extend DeepRMSA to build an intact autonomic EON system.

IV DeepRMSA Design

In this section, we first present the modeling of DeepRMSA, including the definitions of state representation, action space, and reward. Then, we take into account the unique characteristics of dynamic lightpath provisioning and develop two training mechanisms for DeepRMSA.

IV-A Modeling

1) State: The state representation $s_{t}$ for DeepRMSA is an $1\times(2|V|+1+(2J+3)K)$ array containing the information of $\mathcal{R}_{t}$ and the spectrum utilization on $K$ candidate paths for $\mathcal{R}_{t}$ . We use $2|V|+1$ elements of $s_{t}$ to convey $o$ , $d$ (in the one-hot format), and $\tau$ , where $|V|$ represents the number of nodes in $V$ . For each of the $K$ paths, we calculate the sizes and the starting indices of the first $J$ available FS-block, the required number of FS’s based on the applicable modulation format, the average size of the available FS-blocks, and the total number of available FS’s. Hence, we aim to extract key features on different candidate paths, from which DeepRMSA can sense the global EON state. Note that, a more comprehensive design could include the original two-dimensional spectrum state $F$ in $s_{t}$ directly to avoid any information loss. However, this would dramatically increase the scale of $s_{t}$ (i.e., requiring $f_{0}\cdot|E|$ elements simply for conveying $F$ ) and cause scalability issues. Moreover, making DeepRMSA extract useful features from the large-scale binary matrix while incorporating also the topology connectivity and the spectrum continuous and contiguous constraints in EON is not trivial. We will keep this as one of our future research tasks.

2) Action: DeepRMSA selects for each $\mathcal{R}_{t}$ a routing path from the $K$ candidates and one of the $J$ FS-blocks on the selected path. Therefore, the action space (denoted as $A$ ) includes $K\cdot J$ actions.

3) Reward: DeepRMSA receives an immediate reward $r_{t}$ of $1$ if $\mathcal{R}_{t}$ is successfully serviced. Otherwise, $r_{t}=-1$ .

4) DNNs: DeepRMSA employs a policy DNN $f_{\theta_{p}}(s_{t})$ for generating the RMSA policy (i.e., the probability distribution over the action space) and a value DNN $f_{\theta_{v}}(s_{t})$ for estimating the value of $s_{t}$ (i.e., the discounted cumulative reward since $s_{t}$ ), where $\theta_{p}$ and $\theta_{v}$ are the sets of parameters of the DNNs. $f_{\theta_{p}}(s_{t})$ and $f_{\theta_{v}}(s_{t})$ share the same fully-connected DNN architecture [15] except for the output layers. The output layer of $f_{\theta_{p}}(s_{t})$ consists of $K\cdot J$ neurons, while $f_{\theta_{v}}(s_{t})$ has only one output neuron.

IV-B Training

We designed the training of DeepRMSA based on the framework of the A3C algorithm [14]. Basically, A3C makes use of multiple parallel actor-learners (child threads of a DRL agent), each interacting with its own copy of the system environment, to achieve learning with more abundant and diversified samples. The actor-learners maintain a set of global DNN parameters $\theta^{*}_{p}$ and $\theta^{*}_{v}$ asynchronously.

Different from general DRL tasks that can be modeled as Markov decision processes (i.e., the state transition from $s_{t}$ to $s_{t+1}$ follows a probability distribution given by $P(s_{t+1}|s_{t},a_{t})$ ), DeepRMSA involves state transitions which are difficult to be modeled. In particular, due to the fact that $\mathcal{R}_{r+1}$ can be random, there can be infinite possible states for $s_{t+1}$ in DeepRMSA. Thus, we first slightly modified the standard A3C algorithm by defining an episode as the servicing of $N$ lightpath requests, and by making $N$ equal to the training batch size. Here, an episode defines the optimization scope of a DRL task. This way, we eliminate the need for estimating the value of $s_{t+1}$ . We denote DeepRMSA with the episode-based training mechanism as DeepRMSA-EP. Algorithm 1 summarizes the procedures of an actor-learner thread in DeepRMSA-EP. In line 1, the actor-learner initiates an empty experience buffer $\Lambda$ . Then, for each $\mathcal{R}_{t}$ , the algorithm checks whether $\Lambda$ is empty (i.e., a new episode starts), and if true, synchronizes the local DNNs with the sets of global parameters (lines 3-5). Line 6 updates the EON state by releasing the resources allocated to lightpaths that expire. In line 7, we obtain $s_{t}$ based on the model discussed in Section IV-A. In line 8, we invoke the policy and value DNNs to generate an RMSA policy and a value estimation for $s_{t}$ . Note that, in DeepRMSA-EP, we make $s_{t}$ include one more element to indicate the position of $\mathcal{R}_{t}$ regarding the current episode. For instance, if $\mathcal{R}_{t}$ is the $i$ -th request of the episode, we calculate a position indicator as $(N-i+1)/N$ . The algorithm decides an RMSA scheme based on the generated policy (lines 9-10, i.e., with the Roulette strategy) and receives a reward accordingly (line 11). The RMSA sample is then stored in the buffer (line 12). With lines 13-21, DeepRMSA-EP performs training every time the buffer contains $N$ samples. Specifically, in the for-loop of lines 14-16, the algorithm first calculates for each sample $\chi_{t^{\prime}}$ in the buffer the discounted cumulative reward (staring from $\mathcal{R}_{t^{\prime}}$ till the end of the episode) as,

[TABLE]

Then, the advantage of each action being taken can be obtained by,

[TABLE]

which indicates how much an action turns out be better than estimated. Lines 17-18 calculate the policy and values losses $L_{\theta_{p}}$ and $L_{\theta_{v}}$ , from which policy and value gradients can be derived. In particular, $L_{\theta_{p}}$ is defined as,

[TABLE]

where $\alpha$ ( $0<\alpha\ll 1$ ) is a weighting coefficient. The rationale behind Eq 5 is to reinforce actions (i.e., improving the probabilities) with larger advantages while encouraging exploration (by introducing the total entropy of the policies as a secondary penalty term). The definition of the value loss is straightforward as the mean square error from value estimations, i.e.,

[TABLE]

Finally, in lines 19-20, the actor-learner applies the gradients to tune the global DNN parameters with training algorithms such as RMSProp or Adam [21], and empties the buffer to get prepared for the next episode.

Note that, the uncertainty of dynamic lightpath requests can result in unpredictable trajectories of $s_{t}$ , which in turn can cause oscillations of the cumulative rewards and destabilize the training process. This problem becomes especially severe when the numbers of requests involved are small. Recall the calculation of cumulative rewards in Eq. 3, $\Gamma_{t^{\prime}}$ decreases when $\chi_{t^{\prime}}$ is getting closer to the end of the buffer and eventually contains the reward from only one request. To cope with this issue, we propose a window-based flexible training mechanism for DeepRMSA, namely DeepRMSA-FLX. Basically, DeepRMSA-FLX invokes the training process each time the buffer contains $2N-1$ samples. DeepRMSA-FLX slides a window of length $N$ through the buffer and calculates the cumulative reward for each of the first $N$ samples, still with Eq. 3. Thus, every cumulative reward involves the rewards from servicing $N$ requests. By doing so, we aim to smooth out the oscillations equally for all the samples (if $N$ is sufficiently large111Note that, we typically set $N$ moderate values, e.g., $50$ , to allow training signals being applied to the DNNs quickly.). Then, the algorithm calculates the policy and value losses with these $N$ samples and updates the global DNN parameters accordingly. The $N$ samples are removed from the buffer afterward. Meanwhile, the condition for synchronizing local DNNs (line 3 of Algorithm 1) becomes $|\Lambda|$ being equal to $N-1$ in DeepRMSA-FLX.

V Evaluation

V-A Simulation Setup

We evaluated the performance of DeepRMSA with numerical simulations. We first used the 14-node NSFNET topology in Fig. 2 and assumed that each fiber link could accommodate $100$ FS’s. The dynamic lightpath requests were generated according to a Poisson process following a uniform traffic distribution, with the average arrival rate and service duration being $10$ and $15$ time units, respectively. The bandwidth requirement of each request is evenly distributed between $25$ and $100$ Gb/s. The DNNs used ELU as the activation function for the hidden layers. We set $K=5$ and $J=1$ . Hence, DeepRMSA selected only the routing paths and applied the first-fit scheme for spectrum allocation. $\gamma$ , $\alpha$ , $N$ and the learning rate were set as $0.95$ , $0.01$ , $50$ and $10^{-5}$ , respectively. We used the Adam algorithm [21] for training. Note that, we normalized every field of $s_{t}$ before feeding it to the DNNs.

V-B Numerical Results

We first assessed the impact of the scale of the DNNs on the performance of DeepRMSA. We fixed the number of actor-learners as $16$ , and implemented DNNs of three setups for both DeepRMSA-EP and DeepRMSA-FLX, i.e., $3$ hidden layers of $64$ neurons ( $3\times 64$ ), $5$ hidden layers of $128$ neurons ( $5\times 128$ ), and $8$ hidden layers of $256$ neurons ( $8\times 256$ ). Figs. 3(a) and (c) show the evolutions of cumulative rewards (collected from every $1000$ requests) with different DNN setups during training. We can see that for both of the algorithms, DNNs with larger scales facilitate faster training. In average, it takes DeepRMSA $15,000$ and $5,000$ training epochs to converge with DNNs of $3\times 64$ and $5\times 128$ (or $8\times 256$ ), respectively. Eventually, the rewards associated with the three setups are very close, with $5\times 128$ performing slightly better. This is because $5\times 128$ enables a better ability of data representation when compared with $3\times 64$ , and in the meantime does not suffer from the overfitting issue as encountered by $8\times 256$ . Then, we evaluated the impact of the number of actor-learners by fixing the sizes of the DNNs as $5\times 128$ and implementing DeepRMSA with $1$ , $8$ and $16$ actor-learners. Figs. 3(b) and (d) show the corresponding evolutions of cumulative rewards. Again, we can draw the same observations from both of the algorithms, i.e., increasing the number of actor-learners leads to faster convergence and slightly higher rewards. In particular, increasing the number of actor-learners from $1$ to $8$ can accelerate the training speed by a factor of nearly $10$ as multiple parallel actor-learners enable more diversified explorations of the problem. Since the performance gain from further increasing the number of actor-learners is marginal, we expect DeepRMSA with $16$ actor-learners to achieve the best performance. Hence, we fixed the scale of the DNNs and the number of actor-learners as $5\times 128$ and $16$ , respectively, for later evaluations.

Next, we compared the performance of DeepRMSA-EP and DeepRMSA-FLX with that of the baseline algorithms, i.e., SP-FF and KSP-FF. KSP-FF has been shown to achieve the state-of-art performance among the existing heuristic designs [10]. Fig. 4 plots the evolution of request blocking probability from the algorithms. We can see that DeepRMSA-EP and DeepRMSA-FLX perform similarly at the beginning and outperform SP-FF after a training period of only $1,000$ epochs. However, DeepRMSA-FLX successfully beats KSP-FF after a training period of $37,500$ epochs, whereas the performance of DeepRMSA-EP eventually merely fluctuates around that of KSP-FF. After training of $150,000$ epochs, DeepRMSA-FLX can achieve a blocking reduction of $20.3\%$ compared with KSP-FF. To reveal the rationale behind the behaviors of DeepRMSA-EP and DeepRMSA-FLX, Figs. 5(a) and (b) present the results of normalized value loss and entropy of policy during training, respectively. It can be seen that the proposed window-based training mechanism facilitates more accurate value estimations (lower value losses) and stabilized training, while the training of DeepRMSA-EP starts to diverge after $10,000$ epochs. Note that, training periods of thousands of epochs are too costly for practical network operations. A more efficient way of training DeepRMSA is expected to be performing offline training with an RMSA simulator first, before enrolling it in online lightpath provisioning for fine tuning [18].

To verify the robustness of DeepRMSA, we also performed simulations with the 11-node COST239 topology in Fig. 6(a). We set the average request arrival rate and service duration as $20$ and $30$ time units, respectively. All the rest of the parameters remained the same as those for the evaluations with the NSFNET topology. Fig. 6(b) shows the results of request blocking probability with the COST239 topology, which demonstrates a clear performance difference between DeepRMSA-EP and DeepRMSA-FLX. Eventually, DeepRMSA-FLX can achieve a blocking probability that is $14.3\%$ and $18.9\%$ lower than those of KSP-FF and DeepRMSA-EP, respectively.

VI Conclusion

In this paper, we proposed DeepRMSA, a DRL-based RMSA framework for learning the optimal online RMSA policies in EONs. DeepRMSA parameterizes RMSA policies with DNNs and trains the DNNs progressively with experiences from dynamic lightpath provisioning. By taking into account the unique characteristics of the RMSA problem, we developed two training mechanisms for DeepRMSA based on the framework of A3C. Simulation results show that the proposed training mechanisms facilitate successful training of DeepRMSA, which can achieve blocking reductions of more than $20.3\%$ and $14.3\%$ in the NSFNET and COST239 topologies, respectively, when compared with the baselines.

An interesting future research topic would be partitioned DeepRMSA or hierarchical-DeepRMSA where multiple DeepRMSA agents cooperate hierarchically (within the same autonomous system) or interact peer-to-peer through brokers (in a multi-domain EON scenario [22]) to achieve scalability of DeepRMSA applied to topologies with larger scales. Meanwhile, multi-agent DeepRMSA applied to multiple autonomous system networks will introduce game-theoretic approaches similar to the discussions in [23, 24], thus yielding more interesting yet practical multi-agent competitive/cooperative learning problems.

Acknowledgments

This work was supported in part by DOE DE-SC0016700, and NSF ICE-T:RC 1836921.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] O. Gerstel, M. Jinno, A. Lord, and S. J. B. Yoo, “Elastic optical networking: a new dawn for the optical layer?” IEEE Commun. Mag. , vol. 50, pp. S 12–S 20, Apr. 2012.
2[2] M. Jinno, B. Kozicki, H. Takara, A. Watanabe, Y. Sone, T. Tanaka, and A. Hirano, “Distance-adaptive spectrum resource allocation in spectrum-sliced elastic optical path network,” IEEE Commun. Mag. , vol. 48, no. 8, pp. 138–145, Aug. 2010.
3[3] B. Chatterjee, N. Sarma, and E. Oki, “Routing and spectrum allocation in elastic optical networks: A tutorial,” IEEE Commun. Surveys Tuts. , vol. 17, no. 3, pp. 1776–1800, thirdquarter 2015.
4[4] Y. Wang, X. Cao, and Y. Pan, “A study of the routing and spectrum allocation in spectrum-sliced elastic optical path networks,” in Proc. of INFOCOM , April 2011, pp. 1503–1511.
5[5] K. Christodoulopoulos, I. Tomkos, and E. Varvarigos, “Elastic bandwidth allocation in flexible OFDM-based optical networks,” J. Lightw. Technol. , vol. 29, pp. 1354–1366, May 2011.
6[6] M. Klinkowski, M. Ruiz, L. Velasco, D. Careglio, V. Lopez, and J. Comellas, “Elastic spectrum allocation for time-varying traffic in flexgrid optical networks,” J. Sel. Areas Commun. , vol. 31, no. 1, pp. 26–38, January 2013.
7[7] L. Gong, X. Zhou, W. Lu, and Z. Zhu, “A two-population based evolutionary approach for optimizing routing, modulation and spectrum assignments (RMSA) in O-OFDM networks,” IEEE Commun. Lett. , vol. 16, pp. 1520–1523, Sept. 2012.
8[8] H. Wu, F. Zhou, Z. Zhu, and Y. Chen, “On the distance spectrum assignment in elastic optical networks,” IEEE/ACM Trans. Netw. , vol. 25, no. 4, pp. 2391–2404, Aug 2017.