Simulation of Nonlinear Signal Propagation in Multimode Fibers on   Multi-GPU Systems

Marius Brehler; Malte Schirwon; Peter M. Krummrich; and Dominik; G\"oddeke

arXiv:1901.01895·physics.comp-ph·February 19, 2020·Commun. Nonlinear Sci. Numer. Simul.

Simulation of Nonlinear Signal Propagation in Multimode Fibers on Multi-GPU Systems

Marius Brehler, Malte Schirwon, Peter M. Krummrich, and Dominik, G\"oddeke

PDF

TL;DR

This paper presents a GPU-accelerated simulation framework for nonlinear signal propagation in multimode fibers supporting up to 120 modes, addressing computational challenges for large-scale mode-division multiplexing systems.

Contribution

It introduces a multi-GPU implementation for simulating nonlinear MDM fiber systems with many modes, evaluating GPU communication approaches and performance for large-scale simulations.

Findings

01

Efficient multi-GPU simulation of nonlinear multimode fiber propagation.

02

Performance analysis of GPU communication strategies.

03

Impact assessment of nonlinear effects on high-mode-count MDM systems.

Abstract

Mode-division multiplexing (MDM) is seen as a possible solution to satisfy the rising capacity demands of optical communication networks. To make MDM a success, fibers supporting the propagation of a huge number of modes are of interest. Many of the system aspects occurring during the propagation can be evaluated by using appropriate models. However, fibers are a nonlinear medium and, therefore, numerical simulations are required. For a large number of modes, the simulation of the nonlinear signal propagation leads to new challenges, for example regarding the required memory, which we address with an implementation incorporating multiple GPU-accelerators. Within this paper, we evaluate two different approaches to realize the communication between the GPUs and analyze the performance for simulations involving up to 8 Tesla GPUs. We show results for a MDM transmission system utilizing the…

Tables3

Table 1. Table 1: Configuration used for the benchmark.

$M$	$N_{s}$	$N_{s p s}$	$K$	$M / K$
15	$2^{14}$	128	1	15
30			2
60			4
90			6
120			8

Table 2. Table 2: Mode groups (MG), A eff subscript 𝐴 eff A_{\textrm{eff}} in µ m 2 / divide micrometer 2 absent {\mathrm{\SIUnitSymbolMicro m}}^{2}\text{/} , and number of spatial modes.

MG	Modes in Group								$A_{eff}$		Total Num. of Spatial Modes up to this MG
1	LP_0,1								172								1
2	LP_1,1								231								3
3	LP_0,2	LP_2,1							347	311							6
4	LP_1,2	LP_3,1							373	372							10
5	LP_0,3	LP_2,2	LP_4,1						504	469	428						15
6	LP_1,3	LP_3,2	LP_5,1						499	545	475						21
7	LP_0,4	LP_2,3	LP_4,2	LP_6,1					653	605	615	521					28
8	LP_1,4	LP_3,3	LP_5,2	LP_7,1					618	690	674	561					36
9	LP_0,5	LP_2,4	LP_4,3	LP_6,2	LP_8,1				795	732	768	733	601				45
10	LP_1,5	LP_3,4	LP_5,3	LP_7,2	LP_9,1				731	824	835	783	635				55
11	LP_0,6	LP_2,5	LP_4,4	LP_6,3	LP_8,2	LP_10,1			932	853	907	900	834	671			66
12	LP_1,6	LP_3,5	LP_5,4	LP_7,3	LP_9,2	LP_11,1			841	951	979	957	879	702			78
13	LP_0,7	LP_2,6	LP_4,5	LP_6,4	LP_8,3	LP_10,2	LP_12,1		1066	969	1038	1050	1015	926	735		91
14	LP_1,7	LP_3,6	LP_5,5	LP_7,4	LP_9,3	LP_11,2	LP_13,1		946	1072	1124	1112	1066	966	764		105
15	LP_0,8	LP_2,7	LP_4,6	LP_6,5	LP_8,4	LP_10,3	LP_12,2	LP_14,1	1195	1081	1162	1188	1174	1119	1009	795	120

Table 3. Table 3: DMGDs Δ β 1 Δ subscript 𝛽 1 \Delta\beta_{1} in ps / km divide picosecond kilometer \mathrm{ps}\text{/}\mathrm{km} and group-velocity dispersion parameters β 2 subscript 𝛽 2 \beta_{2} in ps 2 / km divide picosecond 2 kilometer {\mathrm{ps}}^{2}\text{/}\mathrm{km} for the different mode groups (MG).

MG	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
$Δ β_{1}$	0.0	-7.32	-7.38	-7.48	-7.54	-7.60	-7.70	-7.79	-7.85	-7.95	-8.04	-8.18	-8.23	-8.39	-8.61
avg. $β_{2}$	-23.1	-23.3	-23.5	-23.8	-24.0	-24.2	-24.5	-24.8	-25.0	-25.3	-25.6	-25.9	-26.2	-26.7	-27.4

Equations20

\frac{\partial A _{a}}{\partial z} = \hat{L} - \frac{α}{2} A_{a} + i n = 0 \sum (\frac{i ^{n}}{n !} β_{n, a} \frac{\partial ^{n}}{\partial t ^{n}}) A_{a}

\frac{\partial A _{a}}{\partial z} = \hat{L} - \frac{α}{2} A_{a} + i n = 0 \sum (\frac{i ^{n}}{n !} β_{n, a} \frac{\partial ^{n}}{\partial t ^{n}}) A_{a}

+ \hat{N} iγ κ_{aa} ∣ A_{a} ∣^{2} + b \neq = a \sum κ_{ab} ∣ A_{b} ∣^{2} A_{a}

A (z + h, T) = exp [h (\hat{L} + \hat{N})] A (z, T) .

A (z + h, T) = exp [h (\hat{L} + \hat{N})] A (z, T) .

A (z + h, T) \approx exp (h \hat{L}) exp (h \hat{N}) A (z, T),

A (z + h, T) \approx exp (h \hat{L}) exp (h \hat{N}) A (z, T),

A (z + h, T) \approx exp (\frac{h}{2} \hat{L}) exp (\int_{z}^{z + h} \hat{N} (z^{'}) d z^{'})

A (z + h, T) \approx exp (\frac{h}{2} \hat{L}) exp (\int_{z}^{z + h} \hat{N} (z^{'}) d z^{'})

exp (\frac{h}{2} \hat{L}) A (z, T)

A_{I} = exp (- (z - z^{'}) \hat{L}) A

A_{I} = exp (- (z - z^{'}) \hat{L}) A

A_{I} =

A_{I} =

k_{1} =

k_{2} =

k_{3} =

k_{4} =

\cdot exp (\frac{h}{2} \hat{L}) \cdot [A_{I} + k_{3}] \hfill

A (z + h, T) =

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Simulation of Nonlinear Signal Propagation in Multimode Fibers on Multi-GPU Systems

Marius Brehler

[email protected]

Malte Schirwon

Peter M. Krummrich

Dominik Göddeke

TU Dortmund, Chair for High Frequency Technology, 44227 Dortmund, Germany

University of Stuttgart, Institute for Applied Analysis and Numerical Simulation, 70569 Stuttgart, Germany

Abstract

Mode-division multiplexing (MDM) is seen as a possible solution to satisfy the rising capacity demands of optical communication networks. To make MDM a success, fibers supporting the propagation of a huge number of modes are of interest. Many of the system aspects occurring during the propagation can be evaluated by using appropriate models. However, fibers are a nonlinear medium and, therefore, numerical simulations are required. For a large number of modes, the simulation of the nonlinear signal propagation leads to new challenges, for example regarding the required memory, which we address with an implementation incorporating multiple GPU-accelerators. Within this paper, we evaluate two different approaches to realize the communication between the GPUs and analyze the performance for simulations involving up to 8 Tesla GPUs. We show results for a MDM transmission system utilizing the extremely large but practically very relevant number of 120 spatial modes as an application example and analyze the impact of the nonlinear effects on the transmitted signals.

keywords:

CUDA , Fiber optics , Multi-GPU , Message passing interface , Multimode fibers , Space-division multiplexing , Split-step Fourier method , Fourth-Order Runge-Kutta, Interaction Picture

††journal: Computer Physics Communications

1 Introduction

One of the main challenges in the design of future optical networks is to satisfy the growing capacity demand. A very promising approach to solve this challenge is to use the yet untapped spatial dimension. Space-division multiplexing (SDM) has attracted a lot of attention in the last years, both in industry and academic research. One option to realize an SDM system is the use of multimode fibers (MMF), where each mode capable of propagation is used as a channel for individual signals, referred to as mode-division multiplexing (MDM). [1]

Recently, the utilization of 45 spatial modes in a multimode fiber as individual transmission channels was demonstrated for the first time [2]. With the availability of mode multiplexers for 45 Hermite-Gaussian modes [3, 4], and to potentially excite even more modes [5], the investigation of MDM systems supporting a large mode count is getting more and more relevant. During the design process of fiber optic transmission systems, numerical simulations are the common choice to study different system aspects. However, especially for a large mode count, new challenges arise within the simulation.

As fused silica is a nonlinear medium [6], the simulation of light propagating in an optical fiber is quite challenging. The nonlinear signal propagation can be described by coupled partial differential equations for which a closed-form solution only exists in very few special cases. Therefore, numerical methods are required to approximate solutions. Exploring the impact of nonlinear effects in the case of data transmission, is already challenging for only a single propagating mode, since long signal sequences need to be simulated. Therefore, GPU-accelerators can be used to speed up simulations [7, 8, 9]. The numerical effort rises sharply when optical fibers which enable the propagation of multiple modes, especially fibers with a core diameter $\geq$ 50\text{,}\mathrm{\SIUnitSymbolMicro m}\text{/}$$, are the target of interest. In those fibers, several tens/dozens or even more than 100 spatial modes can be used as spatial channels. Moreover, the restricted amount of GPU-memory limits the approach to accelerate the simulation of the nonlinear signal propagation [10]. As a result, publications considering the nonlinear signal propagation in MMFs numerically are mostly limited to only a few modes if only a single GPU is used, e.g. 15 spatial modes in [11]. In this paper, we explore the possibility to distribute the simulation of a transmission scenario in a single fiber to multiple GPU-accelerators. Here, we realize the communication between the GPUs with the Message Passing Interface (MPI) or the NVIDIA Collective Communications Library (NCCL). Only with multi-GPU implementations simulations with up to 36 spatial modes and 60 wavelength channels per mode [12] are possible, for which a preliminary version of our MPI-implementation was used.

The paper is organized as follows: We first briefly present the mathematical description of the nonlinear signal propagation in multimode fibers. Next, we review the numerical methods, followed by our MPI and NCCL implementations, as well as GPU-specific modifications to the code required for the simulation of many modes. The description of the implementation is followed by benchmarking the implementation incorporating up to 8 GPUs. Finally, we use the application to demonstrate the simulation of an MDM transmission system in which a fiber with $62.5\text{\,}\mathrm{\SIUnitSymbolMicro m}\text{/}$ core diameter provides 120 spatial modes as MDM channels.

2 Modeling of the Nonlinear Signal Propagation in Multimode Fibers

The nonlinear signal propagation in multimode fibers can be described by the nonlinear Schrödinger [13] or the Manakov equation [14, 15] for multimode fibers:

[TABLE]

Here, $\mathbf{A}$ represents the slowly varying envelopes of the spatial modes. Within the linear part $\hat{L}$ , the coefficient $\alpha$ specifies the attenuation, and the coefficients of the Taylor series expansion of the propagation constants are given by $\beta_{n}$ . Within the nonlinear part $\hat{N}$ , the parameter $\gamma$ is associated with the nonlinear refractive index change which is due to the Kerr-effect. The intramodal nonlinear coupling coefficient is specified as $\kappa_{\mathfrak{aa}}$ , whereas the intermodal interaction is considered by $\kappa_{\mathfrak{ab}}$ . While only intramodal nonlinear effects occur during the signal propagation in a single-mode fiber, this is not the case for multimode fibers. Here, the intermodal effects must be considered additionally. Incorporating the weighted squared absolute values of ${\mathbf{A}}_{\mathfrak{b}}$ increases the numerical complexity significantly, as discussed later.

For a more detailed description of modeling the nonlinear propagation in multimode fibers, see e.g. [16].

3 Numerical Approximation

Since analytical solutions can only be calculated for a few special cases, numerical methods are required for the evaluation of Eq. (1). To approximate the solution of Eq. (1), pseudo-spectral methods like the split-step Fourier method (SSFM) [6] or the fourth-order Runge-Kutta in the Interaction Picture (RK4IP) method [17] can be used.

3.1 The Symmetric Split-Step Fourier Method

The formal solution to Eq. (1) is given by

[TABLE]

With the Baker-Hausdorff formula [18], this can be approximated as

[TABLE]

allowing to solve the linear part $\hat{L}$ and the nonlinear part $\hat{N}$ independently of each other. The linear part $\hat{L}$ is solved in the frequency domain and the nonlinear part $\hat{N}$ is solved in the time domain. The resulting splitting error can be further reduced by applying a symmetric split-step approach:

[TABLE]

The nonlinear part can be either solved by an iterative approach as described in [6] or with explicit schemes like the Runge-Kutta method. This results in a third order accuracy to the step size $h$ and a global error of $O(h^{2})$ . The two variants are denoted here as SSFM-Agrawal and SSFM-RK4, the latter using a fourth-order Runge-Kutta method to solve the nonlinear step.

3.2 The Fourth-Order Runge-Kutta in the Interaction Picture Method

To avoid the splitting error, Eq. (1) can be transformed with

[TABLE]

into the ‘Interaction Picture’, where $z^{\prime}$ is the separation distance. This allows to use explicit schemes like the fourth-order Runge-Kutta method, to solve the differentiated form of Eq. (5). In contrast to the SSFM, no splitting is required, and the numerical accuracy is primarily limited by the applied explicit scheme. With the separation distance $z^{\prime}$ defined as $z+h/2$ , the algorithm to advance $A(z,T)$ to $A(z+h,T)$ is

[TABLE]

as given in [17]. This method exhibits a local error of fifth-order and is globally fourth-order accurate. A more detailed comparison between the SSFM and the RK4IP method, focusing on the nonlinear signal propagation in multimode fibers, is given in [19].

4 Implementation of the Numerical Methods

The signals can be represented by a matrix of sampled data with the dimension $2M\times N$ . Here, $M$ is the number of spatial modes, and the factor 2 results from taking both orthogonal polarization planes into account. The number of discrete time samples is given by $N$ . Thus, each row represents a spatial or polarization mode. To investigate Kerr-based nonlinear effects, namely self-phase modulation (SPM) and especially cross-phase modulation (XPM), long symbol sequences need to be considered. Furthermore, if wavelength-division multiplexing (WDM) is of interest, and thus the impact of four-wave mixing (FWM) should be evaluated, each symbol needs to be represented by an appropriate number of samples to simulate a sufficiently large frequency spectrum. E.g. 256 samples per symbol, denoted as $N_{sps}$ , were used in [11] to simulate a spectral range of $8.192\text{\,}\mathrm{THz}\text{/}$ . In the referenced simulation, $M=15$ spatial modes and $N_{\textrm{s}}=2^{14}$ symbols per spatial and per polarization mode were considered. With $N=N_{s}\cdot N_{\textrm{sps}}$ , this results in a complex valued dense matrix of size $30\times 2^{22}$ , which requires $1920\text{\,}\mathrm{MiB}\text{/}$ of storage. When further increasing the number of spatial modes $M$ , the matrix containing the sampled signal might still fit into the GPU-memory, but not all intermediate results do any longer. We therefore propose to split the $2M$ polarization and spatial modes to $K$ processes. Since $N\gg 2M$ , this approach has several advantages over splitting $N$ contiguous samples of a unique spatial or polarization mode to different processes, as discussed in the next section.

4.1 Splitting the Numerical Problem

In [20], [21] the split-step Fourier method is parallelized by using distributed fast Fourier transform implementations. However, this requires a lot of communication between the involved compute nodes. Instead of letting multiple processes take part in the calculation of a spatial or polarization mode, only entire modes are distributed to the different processes. Here, each process is associated with one GPU, but the process itself can still involve multiple threads. Thus, the $N$ samples of a single signal are only required and processed by one unique process. As proposed, the channels are equally distributed to $K$ processes. With this, each process computes $2M/K$ channels as illustrated in Fig. 1

The matrix representing the sampled signal is stored row-major and the rows are aligned linear in the memory. Therefore, the memory alignment is optimized for the fast Fourier transforms (FFT), as discussed in [19]. The computation of the linear step $\hat{L}$ can be executed fully parallel by each process independently. Only for the calculation of the nonlinear step $\hat{N}$ information from the other processes is required, namely the squared absolute values of the envelopes ${{{A}}}$ of all modes not locally available.

The SSFM-Agrawal requires the computation of ${{\left|{{{{A}}}}\right|}^{2}}$ once at the position $z$ for the first iteration and at the position $z+h$ for every following iteration. Using the SSFM-RK4, the values ${{\left|{{{{A}}}}\right|}^{2}}$ are required to calculate $k_{1}$ , $k_{2}$ , $k_{3}$ , and $k_{4}$ , which is the same for the RK4IP method. The squared absolute values ${{\left|{{{{A}}}}\right|}^{2}}$ can be stored real-valued. Therefore, in every iteration or rather the calculation of $k_{n}$ , $\left(2M-2M/K\right)\cdot N$ real valued numbers have to be provided by the other processes and each process has to share its $\left(2M/K\right)\cdot N$ computed values. The squared absolute values ${{\left|{{{{A}}}}\right|}^{2}}$ are exchanged via MPI or NCCL. Due to the large signal matrices, one has to expect quite large messages even if communication is kept minimal with our splitting approach. For the previous example with matrix size $30\times 2^{22}$ , sharing all squared absolute values would result in a message size of $960\text{\,}\mathrm{MiB}\text{/}$ .

In the following, we apply our modifications to the RK4IP method. The RK4IP allows more than doubled step-sizes $h$ in the simulation of MDM transmission systems, as shown in [19]. Hence, less data exchange is required for the RK4IP method. Nevertheless, the presented approach can be applied in an identical fashion to the SSFM-Agrawal and the SSFM-RK4.

4.2 MPI-Implementation

One option to realize the communication between the involved GPUs is to use the the Message Passing Interface [22]. Using MPI has the advantage, that the GPUs do not necessarily have to be placed in the same compute node. Here, one MPI process per GPU is used. With the availability of CUDA-aware MPI [23] implementations, the programmer does not have to stage the data in the host memory, as the GPU buffers can be directly passed to MPI.

A naive approach to realize the communication via MPI is the use of collective operations like MPI_Bcast or MPI_Allgather. However, these rely on blocking communication and CUDA-aware implementations that supporting non-blocking collectives are still under development. Using non-blocking communication instead has the advantage to overlap communication and processing of the data. Overlapping communication and computations is essential to hide communication costs and to obtain good scalability. We therefore decided to explicitly exchange data via asynchronous, and therefore non-blocking send and receive operations, namely MPI_Isend and MPI_Irecv. The program sequence is described in Listing 1.

After the computation of $|A|^{2}$ for the $(2M)/K$ modes persisting on the GPU, we initialize the data exchange operations. The values are send via MPI_Isend in the send_sqrabs() function, and matching receive MPI_Irecv commands are posted in the recv_sqrabs() function. As mentioned before, these operations are non-blocking and therefore both commands return immediately, even if the transfers are not finished. Next, the CUDA kernel is launched to calculate the contribution to the nonlinear phase rotation of the modes that are persisting on the GPU. This is non-blocking again. Afterwards, a blocking operation MPI_Waitany is called, to wait until any of the MPI_Irecv commands has finished and if, the contribution of the received $|A|^{2}$ values to the nonlinear phase rotation is calculated. If all $|A|^{2}$ values of the $K-1$ other processes are received, and all contributions are taken into account, the nonlinear phase rotation is finally applied to the modes persisting on the GPU.

This approach scales perfectly if the time needed to receive the next data is shorter than the time for the simultaneously performed computations. In this case, the GPU does not have to wait for the next data, since these are received while the GPU is performing computations. The first work package is always available on the GPU, since this is the calculation of calc_nonlinear_own() for which no data needs to be received. However, the execution of calc_nonlinear_others() relies on data sent from the other processes. In practice, the possible overlap strongly depends on the simulation set-up, i.e. the number of spatial modes $M$ and samples $N$ , and is limited by the number of involved GPUs $K$ as well as the interconnects between the GPUs.

4.3 NCCL-Implementation

A higher-level approach is to exchange data via the NVIDIA Collective Communications Library (NCCL). NCCL supports multiple GPUs installed in a single node or across multiple nodes. The library provides topology-aware collective communication primitives and features multiple ring formations for high bus utilization. Within NCCL, the collectives are implemented in a single kernel and are therefore associated to a so-called CUDA stream [24]. The NCCL calls return when the operation is enqueued to the specified stream and the collective operation is executed asynchronously. In our implementation, ncclAllGather is used to aggregate the data. As depicted in Listing 2, we use different streams for the kernel launch within calc_nonlinear_own() and the remaining kernel calls to enable concurrent execution.

To enable the implementation to utilize multiple nodes, we use NCCL together with MPI. Hence, each GPU is associated with an MPI process as before. A common NCCL communicator spanning all processes, is initialized as described in [25, Example 2: One Device per Process or Thread].

4.4 GPU-Acceleration

The GPU-acceleration of the RK4IP implementation is described in [19] and [26]. However, further modifications to the GPU code of our implementation are required in addition to the previously described adaptions.

In the preceding single-node implementation, only a single CUDA kernel capturing the nonlinear effects was launched. As shown before, this is now split up into an own kernel, responsible for calculating the nonlinear phase rotation of the locally stored modes, and an others kernel, responsible for the calculation for the nonlinear phase rotation induced by the modes not locally available. Thus, $K-1$ instances of the latter kernel have to be launched. The overall nonlinear phase rotation is stored in an additional array of size $(2M/K)\cdot N$ . Both kernels incorporate so-called shared memory, to alleviate the penalty occurring due to column access to the memory [19, 26]. In contrast to the single-node implementation, applying the nonlinear phase rotation to the locally available modes now only requires row access instead of column access to the memory. Applying the nonlinear phase rotation is performed by an additional kernel, as already indicated in Listings 1 and 2.

In addition, the interaction matrix no longer fits into the constant memory for a large number of modes without using a splitting approach. Storing all $\kappa$ values, requires a matrix of $2M\times 2M$ elements. Assuming a symmetric matrix, which is the case for linearly polarized (LP) modes [13], it is sufficient to only store the upper triangular matrix, reducing the number of elements to $(2M\cdot 2M)/2+M$ . This is exemplified for the case of $M=2$ and $K=2$ in Fig. 2.

For a huge number of modes, e.g. $M=120$ , still 28920 double precision values of $8\text{\,}\mathrm{B}\text{/}$ would need to be stored in the constant memory, of which only $64\text{\,}\mathrm{KiB}\text{/}$ are available. Therefore, this approach does not lead to a sufficient saving. However, only $2M/K\cdot 2M\cdot 2-(2M/K)^{2}$ need to be accessed for the calculations. For GPU 2, these are the red and green shaded elements in Fig. 2. The other elements of the matrix are only required on the other involved GPU. Taking the symmetry into account again, it is sufficient to save only rows or columns which apply to the modes considered on the certain GPU. With this in mind, the number of elements can be reduced to $2M/K\cdot 2M$ . Furthermore, for the $\kappa$ coefficients describing the nonlinear coupling for the modes persisting in the GPU, it would be sufficient again to only store the upper triangular matrix, as visualized in Fig. 2. However, distributing the matrix via MPI and the necessary index arithmetic is more complicated for this case, and only $(2M/K\cdot 2M/K)/2-M/K$ additional elements can be saved.

5 Benchmark

To achieve the maximum performance, peer-to-peer access between the GPUs is essential. The benchmark is therefore performed on an AWS EC2-instance of type p2.8xlarge. This instance incorporates 4 Tesla K80 accelerators. Each K80 provides a pair of GK210 GPUs, resulting in 8 available GPUs. On this instance type the GPUs are connected via a common PCIe fabric.

The configuration used for the benchmark is given in Table 1.

Considering a sequence with a symbol rate of $32\text{\,}\mathrm{GBaud}\text{/}$ , a spectral range of $4.096\text{\,}\mathrm{THz}\text{/}$ is simulated. Incorporating 8 GPUs allows to evaluate the nonlinear interaction between 120 spatial modes. This is of interest as it is the number of the potentially usable spatial modes in a fiber with $62.5\text{\,}\mathrm{\SIUnitSymbolMicro m}\text{/}$ core diameter [27].

The number of involved processes, or rather GPUs, is scaled up from 1 to 8, to investigate the scaling of the proposed implementations. The number of spatial modes $M$ persisting per GPU is kept constant. In consequence, the total signal matrix occupies up to $7680\text{\,}\mathrm{MiB}\text{/}$ , of which $960\text{\,}\mathrm{MiB}\text{/}$ are stored per GPU. For every calculation of $\hat{N}$ , each GPU needs to share $480\text{\,}\mathrm{MiB}\text{/}$ . For the benchmark, 150 steps have been simulated and all calculations are executed with double precision. Recall, that $\hat{N}$ is calculated 4 times per step. This is the same for an SSFM-RK4 implementation, whereas the number to calculate $\hat{N}$ depends on the number of iterations in an SSFM-Agrawal implementation. The initial distribution and the final collection of the sampled signal matrix, as well as the transfer of further necessary parameters and data, is excluded from the benchmark. Results are shown in Fig. 3.

Here, the execution times $T_{K}$ are normalized to the execution time of our previous single-node, single-GPU implementation [19, 26].

With only a single GPU used, $K=1$ , the relative runtime is $>1$ . Due to splitting the calculation of $\hat{N}$ into several kernels, the runtime increases by approximately $8.5\text{\,}\mathrm{\char 37\relax}\text{/}$ . For $K\leq 4$ , the MPI- and NCCL-implementation scale nearly equal. With even more GPUs involved, the execution time of the MPI-implementation rises sharply. Incorporating all 8 GPUs, the MPI-implementation requires $6.76$ times the execution time of the single-GPU implementation, whereas the NCCL-implementation scales with a factor of 4.26. To evaluate the reason, the benchmarks are rerun without the additional calculations performed in the calc_nonlinear_others(). Therefore, only the amount of communication grows with an increasing $K$ . Here, the MPI-implementation shows nearly identical results, whereas the relative runtime of NCCL-implementation drops. For the MPI-implementation, this clarifies that the increase of execution time is caused by communication, and not by the additional calculations.

In conclusion, communication and calculations can be perfectly overlapped using MPI, additionally confirmed by profiling the application. However, the implementation shows an improvable communication pattern for ${K>4}$ . With the NCCL-implementation on the contrary, communication and calculations are not fully overlapping. Anyway, the topology-aware communication patterns show clear benefits for the simulation with more than 4 GPUs involved. For $K=2$ , highlighted in Fig. 3, the MPI-implementation is slightly outperforming the NCCL-implementation (factor 1.38 vs. 1.51). With only two GPUs taking part in the simulation, the NCCL’s topology-awareness cannot improve communication. In this case overlapping of communication and calculations is much more important.

From a view point of weak scaling, an improved performance is desirable, especially for a large number of involved GPUs. However, regarding the required all-to-all communication the performance metrics are not surprising. Nevertheless, the application enables the simulation of the nonlinear signal propagating of a huge number of spatial modes and a large frequency range, which was not possible so far. Improved performance of the MPI implementation can be expected when decoupling the CPU-GPU control flow. With the future availability of MPI-GDS [28], the asynchronous send operations can be triggered directly after the squared absolute values are computed, leading to better hiding of the communication. In addition, also the optimization of collective operations is under investigation [29, 30]. Therefore, future library implementations offer the potential to further improve the performance of the proposed implementation. In the next section we show, what is already possible with the implementation based on the current available libraries.

6 Simulation of a $62.5\text{\,}\mathrm{\SIUnitSymbolMicro m}\text{/}$ Fiber

Within the application example we demonstrate the feasibility of an MDM transmission over a multimode fiber with graded-index profile featuring a core diameter of $62.5\text{\,}\mathrm{\SIUnitSymbolMicro m}\text{/}$ and a numerical aperture of 0.275. The highest order modes feature small effective refractive indices and are therefore affected most by the cladding. In such a fiber, 120 of the spatial modes capable of propagation can be used for a mode-multiplexed transmission [27]. We assume a profile exponent of 1.94 and a trench, a section with a reduced refractive index within the cladding of the fiber, in a distance of $1.25\text{\,}\mathrm{\SIUnitSymbolMicro m}\text{/}$ to the core and with a width of $3.5\text{\,}\mathrm{\SIUnitSymbolMicro m}\text{/}$ . The refractive index difference between the cladding and the trench is ${6.5\cdot 10^{-3}}$ . Further, an attenuation coefficient of $0.23\text{\,}\mathrm{dB}\text{/}$ is assumed for all modes.

The modes form strongly coupled groups of modes, meaning the linear coupling between the modes belonging to the same mode group is strong, whereas the linear coupling between modes of different mode groups is weak. This is taken into account by choosing the appropriate nonlinear coupling coefficients $\kappa$ for the Manakov equation [15]. The mode groups (MG) are considered as given in Table 2, which further provides the effective areas $A_{\textrm{eff}}$ of the spatial modes. The mode profiles and the later used propagation constants are calculated numerically with JCMsuite [31]. With the definition of $\gamma=(n_{2}\omega_{0})/(c_{0}A_{\mathrm{eff,LP_{0,1}}})$ and a nonlinear refractive index $n_{2}=$ 2.6\text{\cdot}{10}^{-20}\text{,}{\mathrm{m}}^{2}\text{,}{\mathrm{W}}^{-1} $, the fiber features a nonlinear parameter $\gamma=$0.61\text{\,}{\mathrm{W}}^{-1}\text{\,}{\mathrm{km}}^{-1}$ . In the definition of $\gamma$ , the center frequency of the optical signal is given by $\omega_{0}$ and $c_{0}$ is the speed of light in vacuum. The differential mode group delays $\Delta\beta_{1}{=}\beta_{1,\mathfrak{a}}-\beta_{1,\textrm{LP}_{0,1}}$ are calculated with Eq. (29) from [32]. With $\Delta\beta_{1}$ fulfilling the so-called phase-matching condition, the cross-phase modulation between the strongly coupled groups is maximized. This is potentially the worst case for the nonlinear effects. The phase matching condition for multimode fibers incorporates the group-velocity dispersion parameter $\beta_{2}$ , the second derivatives of the propagation constant $\beta$ , which are calculated numerically. The values assumed for the simulation for $\Delta\beta_{1}$ and $\beta_{2}$ are given in Table 3.

In the simulation, each spatial mode carries 60 WDM channels within a $50\text{\,}\mathrm{GHz}\text{/}$ grid. The center frequency of the wavelength channel with the lowest carrier frequency is placed at $191.95\text{\,}\mathrm{THz}\text{/}$ , the one with the highest carrier frequency is placed at $194.9\text{\,}\mathrm{THz}\text{/}$ . Thus, a spectral bandwidth of $3\text{\,}\mathrm{THz}\text{/}$ is used for transmission. Within the simulation, each WDM channel carries a dual polarization (DP) Quadrature Phase-Shift Keying (QPSK) modulated signal, with a symbol rate of $32\text{\,}\mathrm{GBaud}\text{/}$ . The average launch power per DP-QPSK signal is set to $-1\text{\,}\mathrm{dBm}\text{/}$ . Here, we simulate the transmission over two $80\text{\,}\mathrm{km}\text{/}$ spans, resulting in a transmission distance of $160\text{\,}\mathrm{km}\text{/}$ . After each span, the fiber losses are compensated by a noiseless amplifier with flat-gain profile. White noise is added to the signals before the receiver, setting the optical signal-to-noise ratio (OSNR) to $20\text{\,}\mathrm{dB}\text{/}$ . In this regime, the nonlinear effects are the dominating source of the signal degradation. Within the digital signal processing stage, the dispersion is perfectly compensated. Finally, a clock recovery [33] as well as a phase recovery are applied [34]. To quantify the nonlinear impairments, the squared Q-factors are estimated for each mode and each WDM channel.

The minimal, mean, and maximal Q2-factors for each mode after the transmission over $160\text{\,}\mathrm{km}\text{/}$ are depicted in Fig. 4.

Since the fundamental mode features the smallest effective areas $A_{\mathrm{eff}}$ and the highest coupling coefficients, it suffers most from the nonlinear impairments and features the smallest Q2-factor. For the higher order modes, the mean Q2-factors improve. To assess the nonlinear signal distribution, we evaluate the mean Q2-factors relative to Q-factors obtained for a back-to-back (b2b) transmission, shown in Fig. 5.

With about $-0.06\text{\,}\mathrm{dB}\text{/}$ after the first and $-0.13\text{\,}\mathrm{dB}\text{/}$ after the second $80\text{\,}\mathrm{km}\text{/}$ span, the fundamental mode again shows the highest signal degradation. Independent of the transmission distance, the higher order modes are less affected by the nonlinear effects. However, one can clearly identify the lower order mode groups, especially based on the results for a transmission over $160\text{\,}\mathrm{km}\text{/}$ . Also the induced nonlinear penalty increases with increasing transmission distance, the overall penalty is rather small. For lower OSNRs, an even smaller penalty can be expected [12]. Hence, the utilization of 120 spatial modes in an MDM transmission system seems possible, and the Kerr-based nonlinear impairments do not prohibit the use of such a fiber.

7 Conclusion

In this paper, we presented a multi-GPU implementation to simulate the nonlinear signal propagation in multimode fibers. This allows the simulation of a huge number of spatial modes while considering a large spectral bandwidth at the same time. We revealed necessary modification in order to simulate many spatial modes and discussed various approaches how to realize the communication between the GPUs. The performance of the implementation was analyzed, whereas the communication between the GPUs was realized with either MPI or NCCL. While MPI shows performance benefits for a few used GPUs, the implementation clearly profits from NCCL’s topology-awareness if more than 4 GPUs are involved in the simulation. For the first time, it was possible to simulate a mode-division multiplexing system utilizing 120 spatial modes in a $62.5\text{\,}\mathrm{\SIUnitSymbolMicro m}\text{/}$ fiber along with 60 wavelength channels per spatial mode by using 8 GPUs. In the application example, we have evaluated the nonlinear impairments for each spatial mode and each wavelength channel. The results allow to conclude that the nonlinear impairments do not prohibit the usage of such a large number of spatial modes in a mode-division multiplexing system. Regarding the nonlinear effects, it can be expected that one can scale up mode-division multiplexing systems far beyond the most recent transmission experiment in which 45 spatial modes were transmitted over $26.5\text{\,}\mathrm{km}\text{/}$ [2]. The implementation presented here allows to study those future systems.

Acknowledgments

The authors are grateful for the donation of a Tesla K40c by NVIDIA through the GPU Grant program. The work of M. Schirwon and D. Göddeke was supported by the German Excellence Initiative through EXC 310 (SimTech).

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. J. Richardson, J. M. Fini, L. E. Nelson, Space-division multiplexing in optical fibres, Nat. Photon. 7 (5) (2013) 354–362. doi:10.1038/nphoton.2013.94 . · doi ↗
2[2] R. Ryf, et al., High-Spectral-Efficiency Mode-Multiplexed Transmission over Graded-Index Multimode Fiber, in: 44th European Conference and Exhibition on Optical Communication (ECOC), Rome, Italy, 2018, paper Th 3B.1. doi:10.1109/ECOC.2018.8535536 . · doi ↗
3[3] S. Bade, B. Denolle, G. Trunet, N. Riguet, P. Jian, O. Pinel, G. Labroille, Fabrication and Characterization of a Mode-selective 45-Mode Spatial Multiplexer based on Multi-Plane Light Conversion, in: Optical Fiber Communication Conference (OFC) Postdeadline Papers, San Diego, CA, USA, 2018, paper Th 4B.3. doi:10.1364/OFC.2018.Th 4B.3 . · doi ↗
4[4] N. K. Fontaine, R. Ryf, H. Chen, S. Wittek, J. Li, J. C. Alvarado, J. E. A. Lopez, Packaged 45-Mode Multiplexers for a 50- µ m / divide micrometer absent \mathrm{\SI Unit Symbol Micro m}\text{/} Graded Index Fiber, in: 44th European Conference and Exhibition on Optical Communication (ECOC), Rome, Italy, 2018, paper Mo 4E.1. doi:10.1109/ECOC.2018.8535302 . · doi ↗
5[5] N. K. Fontaine, R. Ryf, H. Chen, D. T. Neilson, K. Kim, J. A. Carpenter, Scalable mode sorter supporting 210 Hermite-Gaussian modes, in: Optical Fiber Communication Conference Postdeadline Papers, San Diego, CA, USA, 2018, paper Th 4B.4. doi:10.1364/OFC.2018.Th 4B.4 . · doi ↗
6[6] G. P. Agrawal, Nonlinear Fiber Optics, 5th Edition, Academic Press, 2012.
7[7] S. Hellerbrand, N. Hanik, Fast Implementation of the Split-Step Fourier Method Using a Graphics Processing Unit, in: Optical Fiber Communication Conference (OFC), San Diego, CA, USA, 2010, paper O Tu D 7. doi:10.1364/OFC.2010.O Tu D 7 . · doi ↗
8[8] S. Pachnicke, A. Chachaj, M. Helf, P. M. Krummrich, Fast parallel simulation of fiber optical communication systems accelerated by a graphics processing unit, in: International Conference on Transparent Optical Networks (ICTON), Munich, Germany, 2010, paper Th.B 1.5. doi:10.1109/ICTON.2010.5549002 . · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Simulation of Nonlinear Signal Propagation in Multimode Fibers on Multi-GPU Systems

Abstract

keywords:

1 Introduction

2 Modeling of the Nonlinear Signal Propagation in Multimode Fibers

3 Numerical Approximation

3.1 The Symmetric Split-Step Fourier Method

3.2 The Fourth-Order Runge-Kutta in the Interaction Picture Method

4 Implementation of the Numerical Methods

4.1 Splitting the Numerical Problem

4.2 MPI-Implementation

4.3 NCCL-Implementation

4.4 GPU-Acceleration

5 Benchmark

6 Simulation of a 62.5 \SIUnitSymbolMicrom/62.5\text{\,}\mathrm{\SIUnitSymbolMicro m}\text{/}62.5\SIUnitSymbolMicrom/ Fiber

7 Conclusion

Acknowledgments

6 Simulation of a $62.5\text{\,}\mathrm{\SIUnitSymbolMicro m}\text{/}$ Fiber