A robust machine learning approach for DC bias prediction in DCO-OFDM based Li-Fi systems

Marwah Salman; David Siddle; Yuan Gao; Khalid Taher Mohammed Al-Hussaini; Miguel A. Fernández; Miguel A. Fernández; Miguel A. Fernández

PMC · DOI:10.1371/journal.pone.0336234·November 10, 2025

A robust machine learning approach for DC bias prediction in DCO-OFDM based Li-Fi systems

Marwah Salman, David Siddle, Yuan Gao, Khalid Taher Mohammed Al-Hussaini, Miguel A. Fernández, Miguel A. Fernández, Miguel A. Fernández

PDF

Open Access

TL;DR

This paper improves DC bias prediction in Li-Fi systems using machine learning, achieving high accuracy and robustness with the Random Forest algorithm.

Contribution

A robust ML model selection process using LazyPredict and Random Forest for accurate DC bias prediction in Li-Fi systems.

Findings

01

Random Forest achieved an R-squared of 0.953 and RMSE of 0.233 for DC bias prediction.

02

Hyperparameter tuning and bootstrap sampling improved model stability and performance.

03

The proposed model outperformed previous approaches in accuracy and robustness.

Abstract

The direct current (DC) in optical orthogonal frequency division multiplexing (DCO-OFDM) scheme is commonly adopted in light fidelity (Li-Fi) technology as it offers a spectrally efficient solution. A prior study adopted a machine learning (ML)-based solution to predict the optimum DC bias using key parameters, including the statistical properties of the OFDM transmitted signal and a polynomial regression model. However, the model’s robustness decreased when the data structure was shuffled, indicating limited generalization to unseen data. This study builds upon that work by utilizing the same dataset and improving the prediction model with advanced ML tools, such as the LazyPredict algorithm (LPA), to systematically evaluate and select a regression model. A robust ML regressor selection process is proposed to ensure the reliability of predictions. Additionally, a comprehensive data…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Chemicals3

Li LED ACO

Diseases6

IFFT DC BDC DCO OFDM ML

Figures7

Click any figure to enlarge with its caption.

Fig 1 — Block diagram of DCO-OFDM transmission scheme with QAM modulation.

Fig 2 — The effect of DC bias on clipping noise variance.

Fig 3 — Overview of the proposed methodology for the ML regressor selection process and model evaluation.

Fig 4 — Friedman statistic and p-value across iterations.

Fig 5 — Bar chart of the mean R2 for the best performing regressors.

Fig 6 — Bar chart of the mean RMSE for the best performing regressors.

Fig 7 — Performance comparison of RF model with polynomial regression (with and without data shuffling).

Equations12

Funding2

—http://dx.doi.org/10.13039/501100009928Higher Committee for Education Development in Iraq
—The University of Leicester

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptical Wireless Communication Technologies · Optical Network Technologies · Advanced Photonic Communication Systems

Full text

1 Introduction

Wireless communications play a vital role in a wide range of industrial and everyday applications. The increasing demand for bandwidth has driven the adoption of orthogonal frequency division multiplexing (OFDM), which is a widely used technique to encode digital data across multiple carrier frequencies. This technique is integral to various technologies, including Wi-Fi, 4G long-term evolution (LTE), radio frequency (RF) communication, and optical wireless communications (OWC). Its key advantages include high spectral efficiency, robustness to multipath fading, and scalability [1,2]. To address some expected future issues of Wi-Fi (e.g., limited data transfer rate, lack of security, interference), light fidelity (Li-Fi) has been introduced as a complementary technology to RF communications for indoor and outdoor applications in recent years [3,4]. Compared to traditional RF-based communication systems, Li-Fi offers several benefits, such as high speed, security, and unlicensed bandwidth. Thus, it is becoming a promising technology that complements existing Wi-Fi technologies [5]. OFDM can be used in a different form to achieve the spectral efficiency aspect for the limited bandwidth optical source [i.e., light-emitting diode (LED)]. DC-biased optical OFDM (DCO-OFDM) is a common OFDM variant in optical communications due to its spectrally efficient use compared to other variants such as asymmetrically clipped optical OFDM and Flip OFDM [6].

However, there is a significant challenge hindering the development of DCO-OFDM-based Li-Fi, which is the optimization of DC bias at the transmitter side. If the given DC bias is not sufficient, clipping noise impairments would significantly affect the transmission performance. Conversely, a large DC bias contributes to power inefficiency of the transmission [7]. Therefore, DC bias optimization is formulated as a non-convex optimization problem. Recently, machine learning (ML) has demonstrated broad applicability across diverse domains [8]. In wireless communications, emerging ML-based technologies have offered promising solutions for optimizing optical networks by enabling the network to learn from received signals and optimize its resources [9]. To solve the DC bias optimization problem, an ML-based solution has been explored in the literature to predict the optimum DC bias using the transmitted signal features. For example, reference [10] utilized linear and polynomial regression algorithms to predict the optimal DC bias using a set of signal features and transmission characteristics while satisfying a benchmark bit error rate (BER) constraint. This study showed that the polynomial regression algorithm outperformed the linear algorithm in both R-squared (R^2^) and root mean square error (RMSE), achieving the highest performance with an R^2^ of 96.77% and an RMSE of 0.1925.

However, using only these ML regression algorithms could limit the potential application of advanced ML algorithms in DC bias prediction. A specific example is that when the structure of data samples changes, as occurs in data shuffling, the performance of the polynomial algorithm decreases due to its reliance on the learned structural pattern of the given features. This degradation in performance limits the model generalization and thereby affects the prediction of the optimum DC bias on future unseen data.

In addition, the BER feature obtained from the receiver side was included as an input feature, which is infeasible for real-time prediction undertaking at the transmitter side. A detailed literature review is given in the following section, covering conventional methods and ML-based approaches to investigate the performance of DCO-OFDM in general within the Li-Fi context. However, to the best of our knowledge, there are no reported results on the robustness of ML in DC-bias optimization in DCO-ODFM, with the exception of [10], which utilized linear and polynomial models, and this forms the primary motivation of this work.

This paper improves the optimum DC-bias prediction process presented in [10] to ensure efficient and reliable DCO-OFDM transmission. A robust ML regressor selection process using a LazyPredict algorithm (LPA) is proposed to obtain better prediction performance using the same research dataset. An ensemble learning method such as Random Forest (RF) demonstrated superior performance and improved the robustness and generalization of the ML model, making it more applicable to real-world Li-Fi scenarios. Furthermore, the obtained results were subject to statistical validation, such as the Friedman test. The main contributions of this paper are summarized as follows:

The problem of DC bias optimization is investigated by exploring a robust ML regressor selection process aided by an advanced regression algorithm called LPA.
A comprehensive feature analysis is conducted to evaluate the importance of the features to the optimum DC bias. This process helps in understanding the impact of the relevant features on the model training, ensuring a stable prediction among the transmission cases.
We demonstrate that the ensemble learner model (i.e. RF) outperforms the polynomial regression model used in the prior research, indicating an improvement in the prediction as well as the generalization.
In the validation, a Friedman statistical test is performed to ensure the robustness and reliability of the developed model performance.

The rest of the paper is organized as follows: Sect 2 gives a summary of the DC bias optimization methods used in literature, including both the conventional methods and the ML based approaches. In Sect 3, brief mathematical fundamentals on DCO-OFDM are provided, along with an overview of the ML benefits in Li-Fi applications. The proposed methodology for ML regressor selection is introduced in Sect 4. In Sect 5, the results are presented and discussed. The conclusion is drawn in last Section.

2 Related works

2.1 Conventional methods

In these methods, mitigating clipping noise was approached differently. For example, in [11], an adaptive DCO-OFDM scheme was proposed, where large DC bias values were used to mitigate clipping noise during performance evaluation. However, in this study, transmission performance was assessed for various large DC bias levels, assuming a target BER of 10^−3^. In contrast, reference [12] independently considered the impact of clipping noise on BER performance, separately from the effects of channel noise. The results showed that this type of impairment significantly affects the performance of DCO-OFDM transmission, particularly for high-order multilevel mapping schemes such as quadrature amplitude modulation (QAM). The study proposed a clipping noise mitigation algorithm and noise cancellation procedure at the receiver to improve BER performance. This improvement came with the requirement for several stages of Fourier transform and maximum likelihood detection at the receiving end, which increased the implementation complexity.

To overcome the impact of clipping noise at the transmitter side, an exhaustive search method was proposed in [13] to determine the optimum DC bias for the transmitted signals. However, this optimization process was achieved under a specific optical power constraint. In [14], a companding technique was proposed to compress the negative peaks of the signal to mitigate the clipping noise for those peaks. An inverse companding process was used at the receiver to recover the compressed negative peaks. The results showed that this method achieved better performance over the conventional DCO-OFDM method and significant improvements, particularly for higher modulation orders. In [15], the mean square error (MSE) between the pre-clipped and clipped DCO-OFDM signals was utilized to determine the optimal DC bias and confine the signal within a given range. The performance improvement was mainly achieved for high modulation orders. In [16], a reduction technique for the peak-to-average power ratio (PAPR) was proposed to mitigate the clipping noise. Thereafter, the optimized DC bias for three types of LED was determined. This method applied a random pilot to rotate the phase of the data sequence and prevent the coherent addition of sub-carriers where high peaks occur. Finally, a classical selection algorithm was utilized to determine the lowest PAPR signal for transmission.

In [17], an adaptive DC bias method, named adaptively biased OFDM (ABO-OFDM), was proposed to optimize the DC bias dynamically based on the negative peaks of the signal. The ABO-OFDM method required parts of the bandwidth to accommodate the modified bias which decreased the spectral efficiency of the transmission. In [18], a low-density parity check coding method (LDPC) was proposed to optimize the DC bias according to the LED power constraints and the clipping noise. Optimized DC bias values were determined for 16, 64, and 256 QAM, respectively. Although this method introduced a considerable increase in complexity, the results showed reduced DC bias levels compared to the conventional DCO-OFDM.

2.2 Machine learning methods

ML methods have been explored in the literature to optimize the DC bias in DCO-OFDM systems, aiming to enhance transmission efficiency and mitigate clipping noise distortion. In [10], an ML-based approach was proposed to predict the optimum DC bias using statistical features of the transmitted signal. The study demonstrated superior prediction performance with a polynomial regression compared to a linear model, despite utilizing a relatively small dataset. The optimized DC bias maintained the target BER of the study under an additive white Gaussian noise (AWGN) channel. Beyond optimizing DC bias and mitigating clipping noise at the transmitter, ML-based techniques have also been applied at the receiver. In [19], an artificial neural network (ANN) was employed to mitigate both clipping and channel noise, thus enhancing the overall BER performance. The trained ANN processed the received distorted symbols, reinforcing their correlation with the original transmitted symbols. Experimental results demonstrated a significant BER improvement compared to conventional equalization methods. Similarly, in [20], a deep learning approach using the long short-term memory (LSTM) algorithm was implemented to recover the transmitted symbols at the receiver. The LSTM model achieved performance comparable to the optimal maximum likelihood detection scheme, demonstrating the potential of ML in signal recovery tasks. These findings suggest that ML-driven approaches can substantially enhance Li-Fi transmission. While the integration of ML into optical networks is still evolving, this paradigm shift is increasingly recognized as a promising solution for many optimization issues in network parameters [21].

3 DCO-OFDM based ML in Li-Fi system

3.1 DCO-OFDM fundamentals and clipping noise

A typical block diagram of the DCO-OFDM transmission scheme is shown in Fig 1. The input data is mapped onto M-QAM symbols, where M is the constellation size used for symbol mapping. The inherent nature of QAM symbols is bipolar and complex, represented in the form: $[eqn]$ , where N is the total number of subcarriers. To ensure that the transmitted symbols X adhere to Hermitian symmetry and produce a real-valued time domain signal, they must satisfy the following conditions: $[eqn]$ and $[eqn]$ for 0<l<N/2, where Xl represents the lth data-carrying subcarrier, and Xl and $[eqn]$ are complex conjugates. The Hermitian requirements reduce the actual data-transmitting subcarriers to N/2−1. As a result, the complete input vector to the IFFT has the structure:

[eqn]

Block diagram of DCO-OFDM transmission scheme with QAM modulation.

The sequence of symbols X is then processed using an IFFT to obtain a discrete-time domain signal xm, which is defined as:

[eqn]

where xm signal follows a Gaussian distribution with zero mean and $[eqn]$ variance for large N, and m denotes the sample index [22].

After undergoing processes such as parallel-to-serial conversion, cyclic prefix (CP) insertion, and digital-to-analog conversion (DAC), the time-domain signal xm is shifted by an appropriate DC bias and any remaining negative peaks are then clipped to generate the clipped form of the transmitted signal.

The DC bias, denoted as BDC, must be positive to ensure a unipolar signal, aligning with the requirements of optical transmission. It is defined as:

[eqn]

where

[eqn]

Here, μ represents a positive scaling factor for DC bias adjustment, $[eqn]$ is the standard deviation of the signal, and $[eqn]$ denotes the expectation operator [23]. The average electrical power of the time-domain signal, $[eqn]$ , is proportional to its variance $[eqn]$ . Therefore, the total DC power, $[eqn]$ , can be expressed as

[eqn]

[eqn]

By normalising with respect to $[eqn]$ [24], the expression becomes

[eqn]

Finally, the normalised DC power in decibels is given by

[eqn]

Once the OFDM signals are generated, the optimum DC bias for each signal is determined by iteratively adjusting the scaling factor to ensure the transmission performance remains within a predefined benchmark. A detailed explanation can be found in [10]. The imposed clipping operation to the biased signal causes a clipping noise component that affects the transmission performance. As a result, the DC bias must be accurately calculated to minimize the impact of clipping noise [23,25]. Fig 2 shows the inverse relationship between the clipping noise variance and the DC bias in DCO-OFDM transmission. A larger scaling factor results in a higher DC bias, effectively mitigating clipping noise but at the cost of increased power consumption. Conversely, a smaller scaling factor reduces the DC bias, leading to more pronounced clipping effects, which can introduce nonlinear distortions and degrade the BER performance.

The effect of DC bias on clipping noise variance.

The clipped transmitted signal, $[eqn]$ , which accounts for the clipping noise $[eqn]$ is given by:

[eqn]

After transmission through the optical medium (i.e. LED), where the typical channel incorporates thermal and shot noise as components of AWGN, $[eqn]$ , the received signal y(t) is expressed as:

[eqn]

Where ht represents the channel’s impulse response. At the receiver, the photodetector converts the received optical power into an electrical signal y(t), which contains both the transmitted signal $[eqn]$ and the noise components introduced by the channel [26,27]. It is crucial to distinguish between clipping noise and channel noise, as they originate from different sources and require separate mitigation strategies. Clipping noise, which results from insufficient DC bias, must be minimized at the transmitter to prevent signal distortion before transmission. In contrast, channel noise, introduced during propagation, can be effectively managed through channel estimation and equalization techniques at the receiver [28].

3.2 Machine learning in practical Li-Fi applications

The use of advanced transmission techniques and the implementation of highly flexible principles in optical networks have significantly increased the complexity of their design and operation. This complexity arises from the need to manage various adjustable parameters, such as modulation formats, data rates, and adaptive channel bandwidths. In such flexible use cases, accurately modeling the system with closed-form formulas is often challenging. Recently, ML-based approaches provide the potential to mitigate the nonlinear effects in these networks, which can effectively capture complex behaviors by training on historical network data. The application of ML in physical layer scenarios is primarily driven by the presence of nonlinear effects in optical networks, which render analytical solutions either inaccurate or excessively complex. These nonlinearities can significantly degrade the performance of optical communication systems. However, the practical deployment of ML-based solutions remains challenging due to computational limitations at end-to-end communication terminals, particularly for models that require large volumes of training data [21,29,30].

In the Li-Fi system, the need to provide high-speed connectivity in future applications has opened the door for ML to build intelligent and efficient solutions to overcome many challenges that limit the development phase. So far, many studies have shown that ML plays a crucial role in solving problems such as channel estimation, system optimization, data detection, and decoding [31,32]. In this context, this study aims to develop a robust ML model capable of learning the characteristics of the transmitted signal in a DCO-OFDM system, aiming to enhance system performance by minimizing clipping noise and improving transmission efficiency simultaneously.

4 Methodology

In this section, we explain the methods employed to address the research problem. Each subsection provides detailed insights into the specific aspects of our approach.

4.1 Data description

In this study, we employed the dataset generated and described by Purnita et al. [10], which was specifically designed for machine learning applications aimed at determining the optimum DC bias in DCO-OFDM systems. The dataset was created using a MATLAB simulation model and consists of 250 samples of DCO-OFDM signals generated under diverse transmission conditions. Each sample captures a combination of system parameters and statistical features that are directly relevant to the DC bias optimization problem. The dataset includes the following key parameters: constellation size of QAM modulation (M), number of subcarriers (N), mean, minimum, maximum, standard deviation of the transmitted signal, optimum DC bias, and the resulting BER. To illustrate the distribution of samples across different (N, M) combinations, Table 1 summarizes the number of samples available for each case. As shown, the dataset is not evenly distributed, with certain (N, M) pairs being more heavily represented than others. The values of N are 256, 512, and 1024, which are commonly adopted in OFDM systems. These values are desirable because they ensure the transmitted signal approaches a Gaussian random variable with approximately 95.6% of signal amplitudes falling within twice the standard deviation of the mean [11].

Table 1: Distribution of samples across subcarriers (N) and modulation orders (M).

The modulation order M takes values of 4, 16, 64, 256, and 1024. This wide range represents realistic transmission scenarios by spanning different data rates, with higher orders providing greater throughput but less robustness to clipping noise. The statistical features extracted from each waveform provide a compact representation of the signal and directly inform the DC bias adjustment. Specifically, the minimum and maximum values define the signal’s amplitude range, which determines the margin required to avoid clipping distortion. The mean indicates the average offset of the signal, and the standard deviation characterizes the signal’s power distribution and variability, which influences the scaling factor that primarily controls the required bias. The dataset also records the corresponding BER, serving as a performance benchmark. Together, these features enable the prediction of the optimum DC bias necessary to maintain transmission reliability.

The simulation model used to generate the dataset considered only an AWGN channel. This choice is justified because AWGN is the dominant noise source in indoor optical wireless communication environments, and the primary objective of DC bias optimization is to mitigate clipping noise at the transmitter before signal propagation. The BER benchmark was therefore selected to ensure clipping noise remained confined to an acceptable level. However, following Table 2 in Purnita et al. [10], which presents feature-importance scores for this dataset, features with minimal contribution to the DC bias prediction were excluded from our analysis.

Table 2: ML regression methods and corresponding regressors used in LPA.

4.2 Machine learning model selection framework

The proposed methodology for developing a robust machine learning (ML) regression model to predict the optimized DC bias in DCO-OFDM systems is illustrated in Fig 3. This process consists of multiple stages, including data pre-processing, model training, performance evaluation, performance validation, and optimal regressor selection, ensuring a systematic and data-driven approach to optimizing DC bias. To build an accurate and reliable ML regression model, a pre-processing stage was conducted to refine the dataset. This step ensures that the data is clean, well-structured, and suitable for model training. Specifically, MinMaxScaler from the scikit-learn Python library was applied to normalize the dataset features, rescaling values between 0 and 1. This normalization enhances model performance by reducing the impact of varying feature magnitudes. Following the normalization phase, the dataset was divided into two subsets: 70% of the data was used for training, and the remaining 30% was used for testing the model. This 7:3 training-to-test splitting ratio ensures that a significant portion of the data is designated for the training phase while reserving a sufficient amount of data for evaluating the model’s performance.

Overview of the proposed methodology for the ML regressor selection process and model evaluation.

Selecting the most effective machine learning regression model for a specific problem application remains a significant challenge for researchers, where several factors can influence a model’s performance such as dataset characteristics and model behavior. Therefore, a comprehensive analysis is crucial to evaluate the model’s capability and effectiveness [33]. In this study, a comprehensive evaluation of multiple ML regressors was performed using the LPA to identify the most effective ML regression model. LPA is an advanced Python library designed to automate the comparison of various ML models. The implementation of the LPA library is not yet widely explored but it enables the evaluation of a pool of different regressors [34,35]. This approach is particularly useful for initial regression model selection tasks because it allows us to compare the performance of 41 regressors across various metrics [36]. Table 2 summarizes the learning methods and their corresponding ML regressor in LPA. To mitigate potential biases, the LPA model was trained iteratively five times, with the dataset being randomly reshuffled before each iteration. This iterative process prevented the model from learning any underlying patterns specific to the particular data collection during the simulation. At each iteration, model performance was evaluated using key metrics consistent with those in [10], and the mean performance across all five iterations was calculated to determine overall effectiveness. To validate the statistical significance of the results and confirm that performance differences were not due to chance, a Friedman statistical test was applied across all iterations. Finally, the most effective ML regression model was selected based on its predictive accuracy and statistical validation, ensuring a reliable and data-driven approach to optimizing DC bias in DCO-OFDM systems.

4.3 Machine learning model evaluation

To evaluate the performance of each regression model, the coefficient of determination (R^2^ or R-squared) and RMSE are used primarily to facilitate comparison with related research. The R-squared measures the proportion of variance in the target variable, giving the best prediction at a value close to or equal to 1. Meanwhile, RMSE measures the root mean of the squared differences between the actual and predicted values, reaching 0 for the most accurate prediction. The mathematical representation of R^2^ and RMSE are shown below:

[eqn]

[eqn]

where Xi and Yi are the ith predicted and actual values, respectively, while $[eqn]$ represents the mean of the actual values.

The Friedman statistical test was conducted in this analysis under the assumption of a non-parametric data distribution, making it particularly suitable for small datasets such as the one used in this study. This test compares the differences in performance scores among LPA regressors over multiple iterations. This procedure ensures that the evaluation of our model is not influenced by random variability. If a p-value of less than 0.05 is achieved during the analysis, this indicates that the observed differences in model performance are statistically significant [37].

The Friedman test is widely regarded as a robust method for evaluating multiple models or algorithms in regression and classification studies, particularly when assumptions of normality or homogeneity of variance are not satisfied [38]. When these assumptions aren’t met, the Friedman test is used instead of the traditional statistical tests such as analysis of variance (ANOVA) because it does not rely on strict assumptions [39]. These considerations reinforce the reliability and generalizability of the findings, even with the relatively small dataset. The performance scores were obtained from 5 iterations of running the model while ensuring data reshuffling with each iteration. The mean values of these scores were utilized to identify the most suitable ML regressor for the dataset.

4.4 Final regression model improvement

After selecting the ML regressor, a hyperparameter tuning procedure was conducted using a grid search method to optimize the parameters of the selected ML model and improve the prediction performance. In this method, an exhaustive search is used where a pre-defined set of hyperparameters is specified in advance, and the ML model is trained and evaluated for each combination of these parameters. Considering the relatively small dataset size employed in this study, a data augmentation method, known as Bootstrap sampling, was investigated to increase the dataset’s size and evaluate the model performance with larger training samples.

5 Results and discussion

In this section, we present the key findings of our research and explicit comparison with the previous research outcomes.

5.1 Model training and regression model selection

By leveraging our proposed methodology, the model performance was evaluated based on the R^2^ and RMSE metrics, as shown in Table 3 and Table 4, respectively. Notably, the best performing regressors identified in these tables consistently achieved R^2^ in the range of 0.8 to 0.9, demonstrating the robustness of our approach across these regression models. To address robustness and generalization concerns regarding the results of the proposed model, particularly given the relatively small size of the dataset, a rigorous statistical evaluation was conducted using the Friedman test. This evaluation compares the performance of the regressors across five iterations of the model training process. Fig 4 presents the Friedman statistics (right axis) alongside the corresponding p-values (left axis) for each iteration. From the second iteration, the Friedman test yields p-values of 0.01, indicating statistically significant performance differences among the regressors. To complement the p-values, the effect sizes, expressed as epsilon-squared ( $[eqn]$ ), ranged from 0.90 to 1.00, reflecting a very strong proportion of variance explained by these differences. Taken together, these results demonstrate that the regressors differ in a statistically significant manner, and that the observed differences are also substantial and practically meaningful.

Friedman statistic and p-value across iterations.

Table 3: Performance evaluation of ML regression models using R2.

Table 4: Performance evaluation of ML regression models using RMSE.

These findings provide strong evidence of the model’s robustness, demonstrating that its performance scores are unlikely to have resulted from random variability [40]. To evaluate the findings, only the mean values of the performance metrics were considered to select the ML regression model. The RF regression model is identified as the best performing model, achieving R^2^ of 0.95384 and RMSE of 0.2339. This performance demonstrates the model’s robustness in predicting the optimized DC bias over the remaining regressors, as visualized in Figs 5 and 6. Given this substantial performance, the model is well suited for practical integration into Li-Fi systems. Its implementation can be effectively aligned with a block-based approach for adaptive transmission, where the DC bias is optimized for each individual OFDM block. This strategy reduces the frequency of bias updates, improving computational efficiency while maintaining reliable transmission performance [36]. Recent advances, such as early stopping mechanisms [41,42], demonstrate strategies for reducing energy consumption in microcontrollers, which is particularly relevant for future implementations of ML-based Li-Fi systems on embedded hardware. Similarly, data-parallel RF approaches [43,44] highlight opportunities to accelerate training times. While training is performed offline in the current work, such methods could facilitate periodic model updates in dynamic environments, further supporting the long-term viability of ML-driven parameter optimization in Li-Fi networks.

Bar chart of the mean R2 for the best performing regressors.

Bar chart of the mean RMSE for the best performing regressors.

5.2 Computational complexity

In our study, all simulations and model training were performed on Apple MacBook M2 Pro chip (10-core CPU, 16-core GPU, 32 GB RAM). It is important to emphasize that the primary computational burden lies in the offline training phase; once the model is deployed, only inference is required in real-time operation. The selected RF regressor is computationally lightweight at the inference stage, as it only requires the system to provide the key parameters for predicting the optimum DC bias. When the model was tested for a block-based approach, the average inference time was approximately 0.01 ms per OFDM block of 100 symbols. This latency is negligible compared to the symbol duration in Li-Fi systems, particularly since the update occurs at the block level rather than per individual symbol. Moreover, inference is inherently parallelizable, enabling efficient deployment on embedded processors or FPGA-based systems. These results indicate that the proposed model achieves a favourable balance between accuracy and efficiency, supporting its feasibility for real-time integration in Li-Fi networks.

5.3 Model improvement based on feature selection and hyperparameter tuning

Given the substantial performance of the RF regression model, it is well-suited for future deployment in practical Li-Fi systems. These performance metrics were obtained using the same set of features described in Table 2 of [10] to ensure a fair comparison. However, to further enhance the practicality and interpretability of our model, a feature selection technique was applied to identify the most relevant features for predicting the DC bias. As a result, the Mean and BER features are eliminated due to their weak direct and indirect correlations with the DC bias. The mean value exhibits a weak correlation primarily because the OFDM signal inherently follows a Gaussian distribution with a zero mean. In such a distribution, the overall average of OFDM signal amplitudes is symmetrically distributed around zero regardless of the DC bias level. Consequently, variations in the DC bias do not significantly alter the mean value of the OFDM signal, leading to a weak or negligible correlation between them. Meanwhile, the BER is excluded because the DC bias primarily affects the transmitted signal on the transmitter side of the system, while BER is determined on the receiver side after performing the demodulation process. As a result, including BER in the prediction process would not be practical or meaningful in this context. This procedure of feature refinement is particularly valuable for understanding the impact of each feature on DC bias, which is critical in practical applications. Following this feature selection process, the RF model achieved R^2^= 0.9436 and RMSE= 0.2635. The results show a slight decrease in performance compared to the initial features. However, training the model with the most relevant and correlated features improves its generalization and practicability in deployment.

After evaluating the RF model using the most relevant features, a hyperparameter tuning process was employed using the grid search method to find the optimal configuration that maximizes the model’s predictive accuracy while ensuring strong generalizability to unseen data. The hyperparameters, tuned values, range of search parameters, and the performance scores are shown in Table 5.

Table 5: Results of hyperparameter tuning for RF regressor.

This chosen range of parameters was selected around the bounds of the defaults and typical values known to perform well, while ensuring relevance and computational efficiency. The remaining hyperparameters not included in this table were retained at their default values to maintain practical training without excessive computational cost, as the RF model is known to achieve reliable accuracy with default settings in many applications [45,46]. To ensure that the experimental setup can be fully replicated, a complete table in Appendix A lists all RF hyperparameters, including both the reported tuned values and those retained at their default.

After tuning the hyperparameters with the selected input features, the model performance improved, with R^2^ increasing to 0.9450 and RMSE decreasing to 0.2603, as shown in Table 6. Although this represents only a modest improvement, it constitutes an important and necessary step. The DC bias is strongly correlated with the BER feature, which accounts for the slightly higher performance when BER is included. As mentioned earlier in this section, in practical transmitter-side DC bias optimization, incorporating BER as a predictor is not feasible. Excluding it, therefore, ensures that the model delivers strong and reliable performance suitable for real-world deployment.

Table 6: Performance of RF regression model.

5.4 Superiority of RF model over polynomial regression

In reference [10], data samples were analyzed using linear and polynomial regression models to map the features of the OFDM signal to the optimized DC bias. The reported performance metrics were R^2^ = 0.8406, RMSE = 0.4279 for linear regression, and R^2^ = 0.9677, RMSE = 0.1925 for polynomial regression, suggesting that the polynomial model is the best-performing approach for DC bias prediction in DCO-OFDM systems. To further evaluate the robustness of the polynomial model, we re-assessed its performance on same data but shuffled samples. This resulted in a notable decline in performance, with R^2^ dropping to 0.8922 and RMSE increasing to 0.3595. In contrast, the RF model developed in this study achieved a consistent R^2^ = 0.9450 and RMSE = 0.2603, demonstrating superior accuracy and performance compared to the polynomial. Table 7 summarizes the performance improvement in the RF model relative to the results reported in [10].

Table 7: Performance comparison between Ref [10] and the proposed RF model.

The observed performance reduction in the polynomial model appears to rely on the sequential pattern present in the original dataset. This scenario is unlikely to occur in practical transmission where the system parameters dynamically vary in response to changing channel conditions rather than following a predetermined order. The RF model, however, maintained robust performance on both the original and shuffled datasets. As shown in Fig 7, the proposed RF-based approach provides a meaningful improvement and incremental contribution over prior polynomial model. Thus, highlighting its reliability and suitability for deployment in real-world DCO-OFDM systems.

Performance comparison of RF model with polynomial regression (with and without data shuffling).

5.5 Model improvement based on increasing data samples

The dataset used in this study consists of 250 OFDM samples, which is a legitimate number for experimental work in wireless communication. Increasing the number of training samples was investigated using data augmentation methods to enhance the performance and robustness of the ML model. In methods such as generative adversarial networks, highly realistic data samples that closely resemble the training data can be generated. However, this method is particularly effective for augmenting image data [47]. Similarly, the synthetic minority over-sampling technique can generate new synthetic data by interpolating between the existing training samples.

In the context of this study, the OFDM signal is different in nature from typical data used in these augmentation techniques. It has unique temporal and spectral properties where the orthogonality of sub-carriers is critical to prevent interference. Additionally, the phase and amplitude of the signal are sensitive to noise due to the FFT modulation/ demodulation process. Therefore, applying these standard data augmentation methods to OFDM signals is inadvisable because it can distort the real representation of the properties of the transmitted signal. In this research dataset, the transmission characteristics, denoted by N and M, shape the statistical properties of the transmitted signal and thereby the required DC bias for the corresponding signal. The number of samples for each M and N was not evenly distributed during the simulation, giving insufficient representation to each corresponding signal transmission case.

Bootstrap sampling is a powerful technique used to address the challenges posed by small datasets. In this technique, new data samples are generated by repeatedly drawing samples from the original dataset with replacements. By applying this technique to increase the dataset size, a significant improvement in the RF model performance was achieved, as shown in Table 8. The performance of R^2^ and RMSE improved when the data was doubled, achieving 0.9776 and 0.1626, respectively. Further improvement was observed when the dataset size was tripled, with R^2^ reaching 0.9938 and RMSE decreasing to 0.0851. This significant improvement resulted from the redundancy of the new samples involved in the testing set. This technique provides a significant improvement in balancing the uneven distribution of data and can be applied more effectively in classification problems than in regression to create class balance.

Table 8: Performance of RF regressor with different data size.

5.6 Limitations

This study acknowledges certain limitations that also present future opportunities for research. Although the dataset used introduces a variety of transmission characteristics and captures key statistical features relevant to DC bias optimization, one limitation is its relatively small size of 250 OFDM symbols. The dataset is also not evenly distributed, with certain (N, M) pairs being more heavily represented than others. While sufficient for a proof-of-concept demonstration, a larger dataset would further enhance the robustness and generalizability of the results. Future work could focus on incorporating a larger number of training samples while ensuring an even distribution of key parameters. Additionally, the training process could be supported by employing statistical measures such as the Kappa index to validate the adequacy of the chosen dataset size in relation to the model’s performance [48]. Addressing these aspects would further enhance the scalability, reliability, and broader applicability of the proposed methodology.

6 Conclusion

In this paper, a robust ML selection process was developed to predict the optimized DC bias in DCO-OFDM transmission schemes. The optimal DC bias was determined using the transmitted OFDM signal features to mitigate the impact of clipping noise, thereby improving the overall transmission performance in Li-Fi systems. This study employed a robust ML regression algorithm, which incorporates LPA, to evaluate the performance of various ML regression models. The process of model training was iterated to statistically validate the results. The simulation results demonstrated that the RF regression model outperformed the findings of the previous study. The RF model exhibited strong performance through a comprehensive training and evaluation process, showing its ability to generalize and accurately predict the DC bias in practical applications. Further model improvement, such as increasing the dataset size and hyperparameter tuning, was conducted to ensure the model’s robustness and stability. The developed model is a highly suitable candidate for real-world deployment in DCO-OFDM systems, offering a substantial improvement over conventional optimization methods.

Supporting information

S1 AppendixTable 9. Complete list of RF hyperparameters.(PDF)

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Elsayed EE. Atmospheric turbulence mitigation of MIMO-RF/FSO DWDM communication systems using advanced diversity multiplexing with hybrid N-SM/OMI M-ary spatial pulse-position modulation schemes. Optics Communications. 2024;562:130558. doi: 10.1016/j.optcom.2024.130558 · doi ↗
2Xu Z, Petrunin I, Tsourdos A. Identification of communication signals using learning approaches for cognitive radio applications. IEEE Access. 2020;8:128930–41. doi: 10.1109/access.2020.3009181 · doi ↗
3Haas H, Yin L, Wang Y, Chen C. What is lifi?. Journal of Lightwave Technology. 2015;34(6):1533–44.
4Hammadi AA, Bariah L, Muhaidat S, Al-Qutayri M, Sofotasios PC, Debbah M. Deep Q-learning-based resource management in IRS-assisted VLC systems. Trans Mach Learn Comm Netw. 2024;2:34–48. doi: 10.1109/tmlcn.2023.3328501 · doi ↗
5Hesham H, Ismail T. Hybrid NOMA-based ACO-FBMC/OQAM for next-generation indoor optical wireless communications using Li Fi technology. Opt Quant Electron. 2022;54(3). doi: 10.1007/s 11082-022-03559-1 · doi ↗
6Gawande P, Sharma A, Kushwaha P. Various modulation techniques for Li Fi. International Journal of Advanced Research in Computer and Communication Engineering. 2016;5(3):121–5.
7Bechadergue B, Azoulay B. An industrial view on Li Fi challenges and future. In: 2020 12th International Symposium on Communication Systems, Networks and Digital Signal Processing (CSNDSP). 2020. p. 1–6. 10.1109/csndsp 49049.2020.9249584 · doi ↗
8Ma J, Lei D, Ren Z, Tan C, Xia D, Guo H. Automated machine learning-based landslide susceptibility mapping for the three gorges Reservoir Area, China. Math Geosci. 2023;56(5):975–1010. doi: 10.1007/s 11004-023-10116-3 · doi ↗