AdaNN: Adaptive Neural Network-based Equalizer via Online   Semi-supervised Learning

Qingyi Zhou; Fan Zhang; and Chuanchuan Yang

arXiv:1907.10258·eess.SP·August 26, 2020

AdaNN: Adaptive Neural Network-based Equalizer via Online Semi-supervised Learning

Qingyi Zhou, Fan Zhang, and Chuanchuan Yang

PDF

TL;DR

AdaNN introduces an adaptive online neural network equalizer that rapidly fine-tunes itself without labeled data, significantly improving signal recovery in optical communications with changing link properties.

Contribution

This paper presents AdaNN, a novel online training scheme for neural network equalizers that accelerates convergence and enhances generalization without requiring labeled training sequences.

Findings

01

Accelerated convergence speed by 4.5 times using data augmentation and virtual adversarial training.

02

BER stabilized below 1e-3 after training with 10^5 unlabeled symbols.

03

Outperforms non-adaptive neural networks and traditional MLSE in optical link scenarios.

Abstract

The demand for high speed data transmission has increased rapidly, leading to advanced optical communication techniques. In the past few years, multiple equalizers based on neural network (NN) have been proposed to recover signal from nonlinear distortions. However, previous experiments mainly focused on achieving low bit error rate (BER) on certain dataset with an offline-trained NN, neglecting the generalization ability of NN-based equalizer when the properties of optical link change. The development of efficient online training scheme is urgently needed. In this paper, we've proposed an adaptive online training scheme, which can fine-tune parameters of NN-based equalizer without the help of an online training sequence. By introducing data augmentation and virtual adversarial training, the convergence speed has been accelerated by 4.5 times, compared with decision-directed…

Tables2

Table 1. TABLE I: Convergence time and final BER performance (set2) of AdaNN, with different loss functions, including self-training, Π Π \Pi -model, VAT, and Aug-VAT. The 95 % percent 95 95\% confidence interval of BER estimations are also provided.

Loss	Converge	Final BER	$95 %$ confidence
function	time (batch)	( $\times 10^{- 4}$ )	interval ( $\times 10^{- 4}$ )
Self-training	$72$	$8.39$	$[6.32, 10.84]$
$Π$ -model ( $σ$ =0.1)	$24$	$4.27$	$[2.84, 6.08]$
$Π$ -model ( $σ$ =0.2)	$𝟏𝟔$	$2.75$	$[1.63, 4.25]$
$Π$ -model ( $σ$ =0.3)	$𝟏𝟔$	$2.44$	$[1.67, 3.45]$
$Π$ -model ( $σ$ =0.4)	$32$	$3.43$	$[2.16, 5.08]$
VAT ( $ϵ$ =0.1)	$40$	$5.34$	$[3.72, 7.34]$
VAT ( $ϵ$ =0.2)	$24$	$2.98$	$[1.80, 4.53]$
VAT ( $ϵ$ =0.3)	$𝟏𝟔$	$2.44$	$[1.67, 3.45]$
VAT ( $ϵ$ =0.4)	$𝟏𝟔$	$2.52$	$[1.45, 3.96]$
VAT ( $ϵ$ =0.5)	$24$	$2.82$	$[1.69, 4.34]$
Aug-VAT ( $σ$ =0.10, $ϵ$ =0.3)	$𝟏𝟔$	$2.59$	$[1.51, 4.06]$
Aug-VAT ( $σ$ =0.15, $ϵ$ =0.3)	$𝟏𝟔$	$2.29$	$[1.28, 3.68]$
Aug-VAT ( $σ$ =0.20, $ϵ$ =0.3)	$𝟏𝟔$	$2.75$	$[1.63, 4.25]$
Aug-VAT ( $σ$ =0.30, $ϵ$ =0.3)	$𝟏𝟔$	$3.36$	$[2.10, 4.99]$

Table 2. TABLE II: Ratio r k subscript 𝑟 𝑘 r_{k} for weight matrices between different layers.

$k$	$S (𝑾_{k}^{init})$	$S (Δ 𝑾_{k})$	$r_{k}$
$1$	$71.122$	$5.788$	$0.081$
$2$	$30.641$	$0.929$	$0.030$
$3$	$31.368$	$0.250$	$0.008$
$4$	$34.080$	$0.189$	$0.006$
$5$	$21.338$	$0.172$	$0.008$

Equations32

a_{k}^{(i)} = W_{k} h_{k - 1}^{(i)} + b_{l},

a_{k}^{(i)} = W_{k} h_{k - 1}^{(i)} + b_{l},

h_{k}^{(i)} = σ (a_{k}^{(i)}),

h_{k}^{(i)} = σ (a_{k}^{(i)}),

{W_{k}, b_{k}} min L_{l oss} = {W_{k}, b_{k}} min (- \frac{1}{N _{se q}} i = 1 \sum N_{se q} j = 1 \sum M y_{j}^{(i)} ln (o_{j}^{(i)})),

{W_{k}, b_{k}} min L_{l oss} = {W_{k}, b_{k}} min (- \frac{1}{N _{se q}} i = 1 \sum N_{se q} j = 1 \sum M y_{j}^{(i)} ln (o_{j}^{(i)})),

v^{(i)} = [\hat{r}_{i - L}, ..., \hat{r}_{i}, ..., \hat{r}_{i + L}] .

v^{(i)} = [\hat{r}_{i - L}, ..., \hat{r}_{i}, ..., \hat{r}_{i + L}] .

g_{σ} (v) = v + η, η \sim N (0, σ^{2} I_{Γ (2 L + 1)}) .

g_{σ} (v) = v + η, η \sim N (0, σ^{2} I_{Γ (2 L + 1)}) .

r_{adv}^{(i)} = ar g r; ∥ r ∥ ⩽ ϵ max {- j = 1 \sum M y_{j}^{(i)} ln [(NN_{θ} (g_{σ} (v^{(i)} + r)))_{j}]} .

r_{adv}^{(i)} = ar g r; ∥ r ∥ ⩽ ϵ max {- j = 1 \sum M y_{j}^{(i)} ln [(NN_{θ} (g_{σ} (v^{(i)} + r)))_{j}]} .

L_{l oss} = - \frac{1}{N _{b}} i = 1 \sum N_{b} j = 1 \sum M \overset{y}{^}_{j}^{(i)} ln [(NN_{θ} (g_{σ} (v^{(i)})))_{j}] = - \frac{1}{N _{b}} i = 1 \sum N_{b} \overset{y}{^}_{l_{adv}}^{(i)} ln [(NN_{θ} (g_{σ} (v^{(i)})))_{l_{adv}}],

L_{l oss} = - \frac{1}{N _{b}} i = 1 \sum N_{b} j = 1 \sum M \overset{y}{^}_{j}^{(i)} ln [(NN_{θ} (g_{σ} (v^{(i)})))_{j}] = - \frac{1}{N _{b}} i = 1 \sum N_{b} \overset{y}{^}_{l_{adv}}^{(i)} ln [(NN_{θ} (g_{σ} (v^{(i)})))_{l_{adv}}],

l_{adv} = ar g k max [NN_{θ} (g_{σ} (v^{(i)} + r_{vadv}^{(i)}))]_{k} .

l_{adv} = ar g k max [NN_{θ} (g_{σ} (v^{(i)} + r_{vadv}^{(i)}))]_{k} .

θ_{t + 1} = θ_{t} - \frac{α _{t}}{ψ ( g _{1} , ... , g _{t} )} ϕ (g_{1}, ..., g_{t}),

θ_{t + 1} = θ_{t} - \frac{α _{t}}{ψ ( g _{1} , ... , g _{t} )} ϕ (g_{1}, ..., g_{t}),

L_{l oss} = - \frac{1}{N _{b}} i = 1 \sum N_{b} j = 1 \sum M \overset{y}{^}_{j}^{(i)} ln (o_{j}^{(i)}) .

L_{l oss} = - \frac{1}{N _{b}} i = 1 \sum N_{b} j = 1 \sum M \overset{y}{^}_{j}^{(i)} ln (o_{j}^{(i)}) .

\overset{y}{^}_{j}^{(i)} = {1, if j = ar g max_{k} (o_{k}^{(i)}), 0, otherwise.

\overset{y}{^}_{j}^{(i)} = {1, if j = ar g max_{k} (o_{k}^{(i)}), 0, otherwise.

k_{NN} = Γ (2 L + 1) \cdot R + (l_{NN} - 3) \cdot R^{2} + R \cdot M .

k_{NN} = Γ (2 L + 1) \cdot R + (l_{NN} - 3) \cdot R^{2} + R \cdot M .

k_{back} = 2 k_{NN} + (l_{NN} - 2) \cdot R + M \approx 2 k_{NN} .

k_{back} = 2 k_{NN} + (l_{NN} - 2) \cdot R + M \approx 2 k_{NN} .

k_{back} = = [2Γ (2 L + 1) \cdot R + R] + (l_{NN} - 3) \cdot (2 R^{2} + R) + (2 R \cdot M + M) 2 k_{NN} + (l_{NN} - 2) \cdot R + M .

k_{back} = = [2Γ (2 L + 1) \cdot R + R] + (l_{NN} - 3) \cdot (2 R^{2} + R) + (2 R \cdot M + M) 2 k_{NN} + (l_{NN} - 2) \cdot R + M .

S (W) = i, j \sum ∣ W_{ij} ∣.

S (W) = i, j \sum ∣ W_{ij} ∣.

r_{k} = \frac{S ( Δ W _{k} )}{S ( W _{k}^{init} )} = \frac{S ( W _{k}^{final} - W _{k}^{init} )}{S ( W _{k}^{init} )} .

r_{k} = \frac{S ( Δ W _{k} )}{S ( W _{k}^{init} )} = \frac{S ( W _{k}^{final} - W _{k}^{init} )}{S ( W _{k}^{init} )} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

AdaNN: Adaptive Neural Network-based Equalizer via Online Semi-supervised Learning

Qingyi Zhou, Fan Zhang, and Chuanchuan Yang, Manuscript received xx xx, xxxx; revised xx xx, xxxx (Corresponding author: Chuanchuan Yang).The authors are with the State Key Laboratory of Advanced Optical Communication Systems and Networks, Department of Electronics, Peking University, Beijing 100871, China (e-mail: [email protected], [email protected], [email protected]). This work is funded by National Key R&D Program of China under Grant 2018YFB1801702 and Joint Fund of the Ministry of Education under Grant 6141A02033347.

Abstract

The demand for high speed data transmission has increased rapidly, leading to advanced optical communication techniques. In the past few years, multiple equalizers based on neural network (NN) have been proposed to recover signal from nonlinear distortions. However, previous experiments mainly focused on achieving low bit error rate (BER) on certain dataset with an offline-trained NN, neglecting the generalization ability of NN-based equalizer when the properties of optical link change. The development of efficient online training scheme is urgently needed. In this paper, we’ve proposed an adaptive online training scheme, which can fine-tune parameters of NN-based equalizer without the help of an online training sequence. By introducing data augmentation and virtual adversarial training, the convergence speed has been accelerated by 4.5 times, compared with decision-directed self-training. The proposed adaptive NN-based equalizer is called “AdaNN”. Its BER has been evaluated under two scenarios: a 56 Gb/s PAM4-modulated VCSEL-MMF optical link (100-m), and a 32 Gbaud 16QAM-modulated Nyquist-WDM system (960-km SSMF). In our experiments, with the help of AdaNN, BER values can be quickly stabilized below 1e-3 after trained with $\text{10}^{\text{5}}$ unlabeled symbols. AdaNN shows great performance improvement compared with non-adaptive NN and conventional MLSE.

Index Terms:

Optical fiber communication, Adaptive nonlinear equalizer, Neural network, Semi-supervised learning.

I Introduction

With the continuous development of the Internet, higher bandwidth data transmission is required. Advanced modulation techniques together with novel algorithms have emerged to fulfill the requirements. Digital signal processing (DSP) is quite essential for improving the bit-error-rate (BER) performance and raising the optical link’s transmission rate.

In order to achieve large transmission capacity in short-range optical interconnects, researchers have tried out a variety of conventional DSP techniques. With feed forward equalization (FFE), the data rate of non-return-to-zero (NRZ) has reached 71 Gb/s [1]. Other conventional equalization techniques, such as decision feedback equalizer (DFE) and maximum likelihood sequence estimator (MLSE), have also been utilized [2]-[5]. By utilizing pre-emphasis, 94 Gb/s and 107 Gb/s PAM-4 transmission have been demonstrated by K. Szczerba et al. [6] and J. Lavrencik et al. [7] respectively.

Researchers have also been trying to exploit the potential of DSP algorithms for long-reach optical communication systems. Volterra nonlinear equalizer (VNLE) has been utilized to mitigate nonlinear distortions [8]. A few works have been done to lower the complexity of VNLE [9, 10]. Other nonlinearity compensation techniques have also been investigated, such as digital back-propagation (DBP) [11], perturbation-based compensation [12], and nonlinear Kalman filter [13].

All the above-mentioned DSP algorithms are designed on rich expert knowledge, and some can be proved optimal for tractable mathematical models. However, many nonlinearities (modulation nonlinearity together with square law detection) that exist in practical systems can only be approximately captured and are difficult to compensate with conventional DSP techniques [14]. In order to solve this problem, many DSP algorithms based on neural network have been proposed, including artificial neural network (ANN) based equalizer [15, 16], convolutional neural network (CNN) based equalizer [17] and recurrent neural network (RNN) based equalizer [18, 19]. Implemented in different optical communication systems, these NN-based equalizers have not only reached lower BER, but also shown excellent capability of mitigating nonlinearity.

Although researchers report to have achieved lower BER using NN, there’s one problem: it’s difficult for NN to generalize over varied channel condition. In an actual communication system, the external environment and channel parameters may change, causing the distribution of received data to “drift away”. For example, in data centers fibers are in motion due to rack vibration, which causes the channel properties to vary over time. An NN-based equalizer that performs well on training set/test set may suffer from severe performance degradation [20]. On the other hand, it’s too costly to train different NNs for different communication systems. Due to the lack of the ability to adjust parameters adaptively, existing NN-based equalizers cannot adapt to channel variations and thus, are not practical enough.

Developing adaptive NN-based equalizer is therefore important. A new training scheme is expected, which does not rely on massive amount of collected labeled data. Some previous works on adaptive equalizers based on machine learning also require training sequence [21, 22]. Unfortunately, similar parameter adjustment method cannot be used directly for NN-based equalizers. We’ve found that, when the short training sequence is provided to an NN-based equalizer, the equalizer still suffers from degraded BER performance. Researchers in the field of wireless communication are also exploring possible applications of deep learning techniques [23]-[27]. Most of these relevant works still rely on pilots when model parameters are changed adaptively [23]. S. Schibisch et al. have used error-correcting codes (ECC) to construct labeled dataset for online training, but this causes overhead and relies on special protocol [28]. In [29], the authors claimed that channel estimation based on semi-supervised learning is still an open subject.

In this paper, we propose an adaptive online training scheme, which can be used to fine-tune NN-based equalizer without the help of training sequence. The proposed adaptive NN-based equalizer is called “AdaNN”. The deployment of AdaNN include both offline training stage and online training stage. Although labeled training set is still required at offline stage, at online stage no labeled data needs to be provided. We collect recently received data using a sliding-window, then fine-tune parameters with the help of unlabeled data. Inspired by virtual adversarial training (VAT) which is a semi-supervised learning method, we propose a loss function named “Aug-VAT”, which outperforms naive decision-directed self-training and leads to a 4.5 times speedup. AdaNN is evaluated under two scenarios: a 56 Gb/s PAM4-modulated short-distance (100-m) VCSEL-MMF optical interconnect system, and a 32 Gbaud 16QAM-modulated Nyquist-WDM system (960-km SSMF). Experimental results indicate that the BER performance of AdaNN is much better compared with non-adaptive NN and MLSE. Conclusions can be reached that without training sequence, it’s possible to construct adaptive NN-based equalizer with acceptable computational cost, justifying the significance of our work.

The rest of this paper can be organized as follows. Section II provides a detailed introduction of our proposed online training scheme. In Section III, the computational complexity of proposed AdaNN is analyzed. In Section IV, the BER performance of AdaNN, non-adaptive NN, and MLSE are compared. Section V concludes the paper.

II AdaNN: Online Training based on Semi-supervised Learning

II-A Nonlinear equalizer based on NN

The NN we use contains an input layer, an output layer, and several hidden layers (each hidden layer contains $R$ neurons), as shown in Fig. 1(a)(b). The total number of layers contained in this NN is denoted as $l_{\textrm{NN}}$ . For the $i$ -th symbol, the relationship between adjacent fully-connected layers (denoted as layer $k$ and $k-1$ , where $k\in\{1,...,l_{\textrm{NN}}-1\}$ ) follows

[TABLE]

where $\boldsymbol{W}_{k}$ is $R\times R$ weight matrix, $\boldsymbol{b}_{k}$ is bias vector for layer $k$ . Function $\sigma(\cdot)$ stands for activation function, with softmax chosen for the output layer and ReLU for hidden layers. Different activation functions are displayed in Fig. 1(c)(d).

II-A1 Offline Training Stage

At the offline training stage, the loss function has the form of cross-entropy, which is widely used when dealing with multi-class classification [30]. Denote the total number of symbols contained in the sequence as $N_{seq}$ . The training process can be formulated as

[TABLE]

where $M$ means a symbol only belongs to one of $M$ classes. The loss function $L_{loss}$ measures the difference between predicted probability $\boldsymbol{o}^{(i)}$ and ground truth $\boldsymbol{y}^{(i)}$ . The whole training dataset is divided into batches, each containing a small portion of all $N_{seq}$ training samples. The network parameters are updated iteratively using Stochastic Gradient Descent (SGD) optimizer with momentum, which is much faster compared with vanilla SGD [30].

II-A2 Equalizing Process

During equalization, we denote the received signal sequence after interpolation and zero-mean normalization as $\hat{\boldsymbol{r}}=[\hat{\boldsymbol{r}}_{1},\hat{\boldsymbol{r}}_{2},...,\hat{\boldsymbol{r}}_{N_{seq}}]$ , where vectors $\hat{\boldsymbol{r}}_{1},...,\hat{\boldsymbol{r}}_{N_{seq}}$ correspond to $N_{seq}$ received symbols (following chronological order). The feature vector $\boldsymbol{v}^{(i)}$ for the $i$ -th symbol is constructed as

[TABLE]

We denote the interpolation multiple as $\Gamma$ , thus the dimension of input feature vector $\boldsymbol{v}^{(i)}$ is $\Gamma(2L+1)$ . The input-output relationship is given in Eq. (1)(2).

II-B Proposed AdaNN online training scheme

Suppose that we aim to classify symbols into $M$ classes correctly. For such a multi-class classification problem, data can be either “labeled” or “unlabeled”. The term “labeled” means that for an input vector $\boldsymbol{x}^{(i)}$ , the true label $\boldsymbol{y}^{(i)}$ (which is a one-hot vector) is provided. “Unlabeled” on the other hand, means that the exact classification result is not known.

During online training stage, it is impossible to gather large amount of labeled data. It is possible that the transmitter provide short training sequences for channel estimation/parameter fine-tuning. Unfortunately, short training sequences are not enough for training NN. A possible solution is that, although the exact labels are not known, we can make use of the distribution of received signals to monitor the “drift away” process and use such information to fine-tune our equalizer. Here the concept of semi-supervised learning arises. Semi-supervised learning is a class of machine learning tasks that make use of unlabeled data for training (typically a small amount of labeled data with a large amount of unlabeled data). Unlabeled data helps us by providing information about the probability density distribution of input vectors [31, 32].

Based on the idea of semi-supervised learning, we now explain the process of AdaNN. We focus on the online training stage in this part. First, during online stage a sliding window is utilized to collect data, which is illustrated in Fig. 2. The window, containing $2L+1$ symbols and denoted as bold line, slides on the received signal sequence. At each step $t$ , $N_{b}$ input feature vectors $\boldsymbol{v}^{(1)},...,\boldsymbol{v}^{(N_{b})}$ are collected and serve as a batch. Gradient $\boldsymbol{g}_{t}$ is calculated based on the loss function, and parameters are updated.

When no labels are provided, NN can work adaptively under decision-directed mode. However, if conventional cross-entropy loss function is used, the following problems occur:

(1) When using vanilla cross-entropy loss function, the convergence speed is very slow.

(2) In communication systems, signals are inevitably distorted by different levels of noise. Without data augmentation, the NN is not robust against noise.

We’ve verified experimentally that both VAT and data augmentation can accelerate the training process greatly. Therefore, we propose a loss function named “Augmented Virtual Adversarial Training”, or “Aug-VAT” for short. Aug-VAT combines $\Pi$ -model [33] and VAT [34], considering that the loss function should be consistent with the communication scenario. Fig. 3 shows the general structure of AdaNN with Aug-VAT. The detailed algorithm will be given as follows.

$\Pi$ -model encourages consistent NN output between two realizations of one input vector, under two different data augmentation conditions. Denote $g_{\sigma}(\boldsymbol{v})$ as the input augmentation function. The augmentation is done by generating a random vector $\boldsymbol{\eta}$ and add it on $\boldsymbol{v}$ :

[TABLE]

When using Aug-VAT as loss function in AdaNN, every single input feature vector $\boldsymbol{v}$ should first be replaced using $g_{\sigma}(\boldsymbol{v})$ , then serve as the input feature vector in VAT. VAT is closely related to adversarial training [35]. The adversarial perturbation for the $i$ -th input vector can be defined as

[TABLE]

This equation implies that by adding a small perturbation $\boldsymbol{r}^{(i)}_{\textrm{adv}}$ (satisfying $\|\boldsymbol{r}^{(i)}_{\textrm{adv}}\|\leqslant\epsilon$ ) on $\boldsymbol{v}^{(i)}$ , the loss function calculated using the perturbed input tend to increase. “Adversarial training” means that during training the loss function is always calculated based on the perturbed input vectors rather than the clean ones, so that NN’s robustness can be improved. When full label information $\boldsymbol{y}^{(i)}$ is not available, $\boldsymbol{r}_{\textrm{adv}}$ can only be approximated by computing $\boldsymbol{r}_{\textrm{vadv}}$ , which is derived efficiently using one-time power iteration method (see Algorithm 1).

The complete form of Aug-VAT loss function for a single batch can be formulated as

[TABLE]

where index $l_{\textrm{adv}}$ means that after adding adversarial perturbation and noise, the $i$ -th symbol is classified into class $l_{\textrm{adv}}$ :

[TABLE]

The final pseudocode of our proposed AdaNN online training scheme (with Aug-VAT as loss function) is given in Algorithm 2. The flow chart of AdaNN is displayed in Fig. 4. At step $t$ , the gradient $\boldsymbol{g}_{t}$ is accumulated before all the data in batch $B_{t}$ have been utilized. After that, parameters $\boldsymbol{\theta}_{t}$ should be updated using gradient-based optimizer. Here, all gradient-based optimization algorithms can be written in the following general form [36]:

[TABLE]

where $\boldsymbol{g}_{t}$ represents the gradient obtained at the $t$ -th time step, $\alpha_{t}/\psi(\boldsymbol{g}_{1},...,\boldsymbol{g}_{t})$ denotes the adaptive learning rate, and $\phi(\boldsymbol{g}_{1},...,\boldsymbol{g}_{t})$ is the gradient estimation. Several influential optimizers include: SGD [37], Momentum SGD [38], Nesterov Momentum [39], AdaGrad [40], RMSprop [30], and Adam [41]. Choosing the right optimizer has great impact on AdaNN’s performance. Experimental results show that Adam performs the best for our task.

II-C Other choices for loss function

When labels are not provided, $y_{j}^{(i)}$ in Eq. (3) should be replaced with pseudo-label $\hat{y}_{j}^{(i)}$ . The loss function still has the cross-entropy form

[TABLE]

Pseudo-label $\hat{y}_{j}^{(i)}$ can be obtained in different ways, corresponding to different loss functions. In the last subsection, we’ve proposed Aug-VAT as loss function, which combines $\Pi$ -model with VAT. Besides, vanilla self-training, $\Pi$ -model and VAT can also be used alone as loss function.

(1) Self-training: For the $i$ -th input feature vector $\boldsymbol{v}^{(i)}$ , the output probability vector $\boldsymbol{o}^{(i)}=\textrm{NN}_{\boldsymbol{\theta}}(\boldsymbol{v}^{(i)})$ . The pseudo-label $\hat{\boldsymbol{y}}^{(i)}$ can be derived by

[TABLE]

Self-training is similar to decision-directed mode of conventional adaptive equalizers, and thus serves as a baseline.

(2) $\Pi$ -model only: The main difference between $\Pi$ -model and self-training lies in data augmentation. The output probability vector $\boldsymbol{o}^{(i)}=\textrm{NN}_{\boldsymbol{\theta}}(g_{\sigma}(\boldsymbol{v}^{(i)}))$ , where $g_{\sigma}(\cdot)$ follows Eq. (5). The derivation of $\hat{\boldsymbol{y}}^{(i)}$ is the same as Eq. (11).

(3) Virtual adversarial training only: When using vanilla VAT, the output probability vector $\boldsymbol{o}_{\textrm{adv}}^{(i)}=\textrm{NN}_{\boldsymbol{\theta}}(\boldsymbol{v}^{(i)}+\boldsymbol{r}_{\textrm{vadv}}^{(i)})$ , where $\boldsymbol{r}_{\textrm{vadv}}^{(i)}$ is the virtual adversarial perturbation vector calculated from Algorithm 1. The derivation of $\hat{\boldsymbol{y}}^{(i)}$ is the same as Eq. (11).

All these loss functions are compared in Section IV. There are several other loss functions we haven’t covered. “Temporal ensemble” [33] requires re-evaluation of all training samples each time the NN parameters are updated, which is too costly. “Mean teacher” [42] constructs an ensemble using current model and several past models during training. Our experiments show that “Mean teacher” has no difference compared with self-training. We also know that many semi-supervised learning algorithms are based on “low dimension manifold assumption”, which assumes that data lie on a manifold of much lower dimension compared with input space. Relevant algorithms include low dimension manifold model (LDMM) and curvature regularization (CURE) [43, 44]. However the estimation of local dimension/curvature requires access to all data points in a small area, which cannot be guaranteed.

III Computational Complexity

In this section, we focus on analyzing the computational cost of AdaNN. Fully-connected NNs mainly involve two types of computations: multiplications and activation functions. Note that here rather than $\tanh(\cdot)$ , ReLU activation function is used. Therefore when analyzing complexity, activation doesn’t need to be considered. For a non-adaptive deep neural network, the calculation of output probability vector $\boldsymbol{o}$ is called “forward propagation”, which follows Eq. (1)(2). When equalizing a single symbol, each layer can be viewed as a vector. The number of neurons contained in all $l_{\textrm{NN}}$ layers are: $\Gamma(2L+1)$ , $R$ , $R$ , …, $R$ , and $M$ . The number of floating-point multiplications $k_{\textrm{NN}}$ can be calculated as

[TABLE]

As for AdaNN, all the parameters need to be adjusted online. According to Appendix. A, for a single back-propagation, the number of required floating-point multiplications $k_{\textrm{back}}$ can be calculated as

[TABLE]

The computational cost of back-propagation is slightly larger than two times the cost of forward propagation.

When $\Pi$ -model serves as loss function, two forward propagations and one back-propagation are needed in a single iteration. Thus the computational cost of AdaNN ( $\Pi$ -model as loss function) should be approximately 4 times the cost of non-adaptive NN. When Aug-VAT serves as loss function, two forward propagations, one back-propagation and computing $\boldsymbol{r}_{\textrm{vadv}}$ (mainly includes one forward propagation and one back-propagation) are needed in a single iteration. In total, the computational cost of AdaNN (Aug-VAT as loss function) should be approximately 7 times the cost of non-adaptive NN. As a contrast, the computational cost of AdaNN (self-training as loss function) is 3 times the cost of non-adaptive NN.

IV Experimental Results

In order to justify AdaNN’s wide applicability, we’ve conducted experiments under two scenarios: a 56 Gb/s PAM4-modulated VCSEL-MMF optical link (100-m), and a 32 Gbaud 16QAM-modulated Nyquist-WDM system (960-km SSMF). The results are analyzed in this section.

IV-A Different Loss Functions

We first present the adaptive training results for all the $4$ different loss functions. In this subsection experiments are conducted with a 56 Gb/s PAM4-modulated VCSEL optical link, which is depicted in Fig. 5. The system mainly consists of a directly modulated 850-nm VCSEL, 100-m OM4 MMF, and a photodiode (PD). The received signal is sampled using a high-speed real-time digital signal oscilloscope (DSO). The 850-nm VCSEL is New Focus*®* 1784, while PD is New Focus*®* 1484-A-50. The OM4 MMF is chosen as YOFC*®* MaxBand*®* OM4 multimode fiber. The DSO is Agilent DSAX96204Q, with sampling rate of $160$ GSa/s. We first 4x resample the received signal as stated in [45] ( $\Gamma=4$ for all experiments). The signal is then normalized, and input feature vectors are constructed. We’ve generated two sets of PAM-4 symbols with Bit-pattern Generator (BPG) of SHF 12104A (56 Gb/s). Following [46], we did not use PRBS pattern. Instead, a binary sequence is first generated by applying $\textrm{sign}(\cdot)$ function to an Gaussian noise sequence generated in MATLAB, then converted into two PAM-4 sequences. Each of the two datasets (denoted as set1 and set2) contains $2^{20}$ PAM-4 symbols. Set2 was collected $56$ hours after we collected set1. For both set1 and set2, the receive optical power (ROP) is $-2.7$ dBm. Between these two experiments, we rebuilt the experimental system and adjusted the position of optical fiber, in order to simulate a realistic scenario where fiber properties change slightly. Our NN-based equalizer with $4$ hidden layers ( $l_{\textrm{NN}}=6$ , $R=10$ ) is first trained offline using $25\%$ data in set1. The tap number of NN is first optimized by testing $L\in\{1,3,5,7,9,11\}$ . Then tap number is fixed as $L=5$ since it achieves satisfying BER performance. The batch size is fixed as $N_{b}=16384$ for offline training. Momentum SGD is used, with initial learning rate $\alpha=0.004$ and moving average decay $\beta=0.9$ . The model is trained for $200$ epochs (An epoch represents a single pass through the entire training set, meaning that all feature vectors in the training set have been used for exactly one time). This ensures good convergence and a BER lower than $10^{-3}$ .

During online stage, a sliding window is utilized to collect data, as Fig. 2 shows. Set1 and set2 are concatenated and then processed sequentially. The batch size for online training is $N_{b}=8192$ . Adam optimizer is used, with initial learning rate $\alpha=0.01$ , $\beta_{1}=0.9$ , and $\beta_{2}=0.999$ . When processing set1 and set2 sequentially, the BER for set1 will remain relatively low, while for set2 the BER will increase abruptly. By utilizing an online training scheme, hopefully the BER will then decrease to a low level. BER curves obtained are smoothed by averaging the BER values of neighboring $8$ batches. We mainly focus on two quantities:

(1) Convergence time, defined as the number of batches it takes before AdaNN satisfies two conditions: reaching a BER lower than $10^{-3}$ on recent $8$ batches, and reaching an overall BER lower than $10^{-3}$ on set2.

(2) Final BER, defined as AdaNN’s BER on set2 at the end of online training stage.

The convergence time as well as final BER are summarized in Table. I. The numbers in bold represents the best performance among one class of training method (only hyper-parameters are changed). From Table. I we can tell that, while self-training suffers from slow convergence, AdaNN can be $4.5$ times faster ( $72\div 16=4.5$ ), which indicates AdaNN’s effective usage of unlabeled data. The final BER values on set2 are also displayed in order to show AdaNN’s good generalization ability. In the following experiments, AdaNN’s loss function is fixed as Aug-VAT ( $\sigma=0.15$ , $\epsilon=0.3$ ). We’ve also calculated the change of weight matrices before and after the online training stage. The results are given in Appendix. B.

IV-B 100-m VCSEL-MMF link

In this subsection, AdaNN is evaluated in the 100-m VCSEL-MMF optical link described above. During online stage, set1 and set2, each containing $128$ batches (batch size $N_{b}=8192$ ) are concatenated and processed sequentially. The BER curve of AdaNN is compared with multiple equalizers, including non-adaptive NN, NN with training sequence, and MLSE.

IV-B1 Compare: Non-Adaptive NN

The BER performance of AdaNN as well as a non-adaptive NN is displayed in Fig. 6. The network structure of NN is the same as that of AdaNN. Before online equalization, both AdaNN and the NN are trained offline using $25\%$ data in set1 (received optical power $-2.7$ dBm). When processing set2, the BER of NN rises to about $1.6\times 10^{-2}$ abruptly, and remains unchanged since it’s non-adaptive. While AdaNN’s BER also rises when first encountering set2, the BER soon drops below $10^{-3}$ . Note that it only takes about $40$ batches before the BER stabilizes again.

We’ve also tested another AdaNN model, which is initially trained offline using a different dataset. The ROP of the new dataset is $-4.7$ dBm. Surprisingly, compared with AdaNN (trained@ $-2.7$ dBm), AdaNN (trained@ $-4.7$ dBm) can achieve very similar BER. This indicates that the adaptive training process of AdaNN is robust even when a different offline-trained model is used.

IV-B2 Compare: NN with training sequence

We’ve already demonstrated that AdaNN can adjust its parameters without the help of labels. It’s still necessary to investigate the BER performance of normal NN when short training sequence can be provided. In this part, training sequences are provided at the beginning of set1 and set2 respectively. Concretely, NN’s BER on a single batch is measured immediately after NN has been trained on this batch. By minimizing the cross-entropy loss function, the NN is trained using the training sequences for $100$ iterations, ensuring convergence. Denote the ratio of training sequence length to set1 length (or set2 length) as $\gamma$ . Fig. 7 shows the BER of NN-based equalizer, fine-tuned with provided training sequence. If $\gamma>1/32$ (including more than $32768$ symbols), the performance is similar to AdaNN. For smaller $\gamma$ , the performance degradation becomes unacceptable. Our results show that AdaNN can achieve better (at least similar) BER performance compared with NN-based equalizer, even when a portion of labels are provided to that NN. There are cases when sending extra training sequence cannot be supported. AdaNN provides an effective alternative for these occasions.

IV-B3 Compare: Conventional MLSE

We have also compared AdaNN with conventional MLSE. The memory length of MLSE takes its value from $l_{ch}\in\{1,3,5,7\}$ . The channel response coefficients are estimated using least mean square (LMS) algorithm. Note that true labels are provided when updating these coefficients, indicating that MLSE works adaptively in a supervised manner. The update frequency of channel response coefficients is exactly the same as the update frequency of AdaNN parameters. The BER performance of both AdaNN and MLSE are displayed in Fig. 8. Obviously, once AdaNN converges it has much lower BER ( $3.0\times 10^{-4}$ ) compared with MLSE ( $2.4\times 10^{-3}$ ). The results show that AdaNN’s generalization ability is stronger than adaptive conventional algorithms.

IV-B4 Discuss: Influence of Batch Size

In previous paragraphs, the batch size $N_{b}=8192$ . Choosing different $N_{b}$ will influence the online training process. Concretely, $N_{b}$ describes how much data needs to be collected before AdaNN updates its parameters. Smaller $N_{b}$ seems beneficial since model parameters are updated more frequently. Unfortunately, a very small $N_{b}$ causes new problem, since it leads to very small batch containing few data, which does not reflect the overall probability distribution. Fig. 9 provides the BER performance when $N_{b}$ takes different values, ranging from $128$ to $32768$ . As can be seen from the case $N_{b}=128$ , when trained on very small batches AdaNN may fail to converge. When building an actual system, $N_{b}$ should be chosen carefully, depending on how frequently the link properties change.

IV-C Long-distance optical transmission

In fact, AdaNN can be used in many communication systems. In this subsection, in order to show its wide applicability, AdaNN is evaluated with a 32 Gbaud 16QAM-modulated Nyquist-WDM system, which is depicted in Fig. 10. An arbitrary waveform generator operating at 64 GSa/s generates 32 Gbaud 16QAM baseband signals. A root-raised-cosine (RRC) filter with a roll-off factor of $0.1$ is chosen for Nyquist pulse shaping. At the transmitter, we use external cavity laser (ECL) with narrow linewidth of $25$ kHz. The transmission link consists of 12 spans of 80 km SSMF with erbium-doped fiber amplifier (EDFA) only amplification. At the receiver, an optical band pass filter (OBPF) with 45 GHz bandwidth is used as the receiving filter. The coherent receiver consists of an optical local oscillator (LO) with 25 kHz linewidth, optical hybrid, and balanced detectors (BD). A real-time oscilloscope operating at 80 GSa/s stores the received signal. The offline DSP has several stages. First a FIR filter roughly compensates for accumulated dispersion, then carrier frequency recovery is conducted. After synchronization, carrier phase recovery is conducted, and finally AdaNN is used to mitigate nonlinear distortions.

Several different sets of 16QAM symbols are collected, each containing $50400$ symbols. Two sets are chosen (denoted as set1 and set2) and concatenated. For set1, the ROP is [math] dBm, while for set2 the ROP is $-1$ dBm. The polarization of set2 is different from set1. The received signals are 4x resampled ( $\Gamma=4$ ). AdaNN with $4$ hidden layers ( $l_{\textrm{NN}}=6$ , $R=10$ ) is first trained offline using $50\%$ data in set1. During online stage, the batch size is $N_{b}=1260$ . When processing set1 and set2 sequentially, the BER performance on different batches are displayed in Fig. 11. Again, AdaNN shows adaptivity and performs better than non-adaptive NN.

V Conclusion

In this paper, we propose an adaptive online training scheme, which can be used to fine-tune NN-based equalizer without the help of training sequence. The proposed adaptive NN-based equalizer is called “AdaNN”. At the online stage, recently received data are collected using a sliding-window. With the help of unlabeled data, all the parameters in our NN are fine-tuned in an unsupervised manner, which is similar to decision-directed adaptive equalization. The performance of AdaNN is evaluated under two scenarios: a 56 Gb/s PAM4-modulated VCSEL optical link, and a 32 Gbaud 16QAM-modulated optical transmission system (960-km SSMF). Heterogeneous datasets are concatenated to test AdaNN’s adaptivity. Our experimental results indicate that by introducing AdaNN, the BER performance can be improved compared with both non-adaptive NN-based equalizers and conventional MLSE. Compared with self-training which serves as a baseline, AdaNN’s convergence speed can be 4.5 times faster. The online training process has been proved robust when different offline-trained models are used, which shows AdaNN’s wide applicability. The computational complexity of AdaNN training scheme is also analyzed theoretically. We conclude that it is feasible to construct adaptive NN-based equalizer with acceptable computational cost when training sequences aren’t provided. The generalization ability of all NN-based equalizers can be greatly improved using our proposed method.

Appendix A The complexity of back-propagation

For AdaNN, all the parameters (including both weights and biases) need to be adjusted online, based on gradients $\nabla_{\boldsymbol{W}_{k}}L_{loss}$ and $\nabla_{\boldsymbol{b}_{k}}L_{loss}$ . According to [30], all these gradients are calculated by implementing back-propagation algorithm, which is given in Algorithm 3.

Consider the back-propagation from layer $k$ to layer $k-1$ . The number of neurons contained in each layer can be denoted as $R_{k}$ and $R_{k-1}$ . It’s straightforward to see that, during the back-propagation from layer $k$ to layer $k-1$ , $\boldsymbol{g}\leftarrow\boldsymbol{g}\odot\sigma^{\prime}(\boldsymbol{a}_{k})$ requires $R_{k}$ multiplications, $\nabla_{\boldsymbol{W}_{k}}L_{loss}=\boldsymbol{g}\boldsymbol{h}_{k-1}^{\top}$ requires $R_{k-1}\cdot R_{k}$ multiplications, and $\boldsymbol{g}\leftarrow\boldsymbol{W}_{k}^{\top}\boldsymbol{g}$ requires $R_{k-1}\cdot R_{k}$ multiplications. By summing over all layers, we can conclude that for a single back-propagation, the number of required floating-point multiplications $k_{\textrm{back}}$ can be calculated as

[TABLE]

Appendix B Weight changes

It would be useful to know what changes does online training stage have on NN parameters. We’ve compared the weight matrices in two models: the initial model trained offline, and the final model which has been online trained with set1 and set2. Our NN model has $l_{\textrm{NN}}=6$ layers: $1$ input layer, $4$ hidden layers, and $1$ output layer. There are $5$ weight matrices between these $6$ layers, denoted as $\boldsymbol{W}_{1},...,\boldsymbol{W}_{5}$ .

Define function $S(\boldsymbol{W})$ as summing up the absolute value of all elements in matrix $\boldsymbol{W}$ :

[TABLE]

We now calculate the following ratio for all layers ( $k=1,2,...,5$ ):

[TABLE]

The ratio $r_{k}$ reflect the change of weight matrix $\boldsymbol{W}^{\textrm{init}}_{k}$ . The results are given in Table. II.

It can be concluded that only minor changes have occurred during online training stage. On the other hand, it can be observed that $r_{k}$ becomes larger for weight matrices near the input layer.

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. Kuchta, A. Rylyakov, F. Doany, C. Schow, J. Proesel, C. Baks, and A. Larsson, “A 71 Gb/s NRZ modulated 850-nm VCSEL-based optical link,” IEEE Photon. Technol. Lett. , vol. 27, no. 6, pp. 577-580, Jan. 2015.
2[2] D. Kuchta, “Higher speed VCSEL links using equalization”, presented at the European Conf. on Optical Communication , Düsseldorf, Germany, 2016, Paper Tu.1.A.2.
3[3] Z. Tan, C. Yang, Y. Zhu, Z. Xu, K. Zou, F. Zhang, and Z. Wang, “High speed band-limited 850-nm VCSEL link based on time-domain interference elimination,” IEEE Photon. Technol. Lett. , vol. 29, no. 9, pp. 751-754, May. 2017.
4[4] F. Karinou et al. , “Experimental performance evaluation of equalization techniques for 56 Gb/s PAM-4 VCSEL-based optical interconnects,” presented at the European Conf. on Optical Communication , Valencia, Spain, 2015, Paper P.4.10.
5[5] T. Lengyel, K. Szczerba, P. Westbergh, M. Karlsson, A. Larsson, and P. Andrekson, “Sensitivity improvements in an 850-nm VCSEL-based link using a two-tap pre-emphasis electronic filter,” J. Lightw. Technol. , vol. 35, no. 9, pp. 1633-1639, Dec. 2016.
6[6] K. Szczerba, T. Lengyel, M. Karlsson, P. Andrekson, and A. Larsson, “94 Gb/s 4-PAM using an 850-nm VCSEL, pre-emphasis, and receiver equalization,” IEEE Photon. Technol. Lett. , vol. 28, no. 22, pp. 2519-2521, Nov. 2016.
7[7] J. Lavrencik, V. Thomas, S. Varughese, and S. Ralph, “DSP-enabled 100 Gb/s PAM-4 VCSEL MMF links,” J. Lightw. Technol. , vol. 35, no. 15, pp. 3189–3196, Aug. 2017.
8[8] Y. Gao, F. Zhang, L. Dou, Z. Chen, and A. Xu, “Intra-channel nonlinearities mitigation in pseudo-linear coherent QPSK transmission systems via nonlinear electrical equalizer,” Opt. Commun. , vol. 282, no. 12, pp. 2421-2425, Jun. 2009.