Accurate forecasting of photovoltaic optimal points and efficiency using advanced hybrid machine learning models

Anjan Kumar; Md Asif; Malak Naji; B. Spoorthi; Badri Narayan Sahu; S. Radhika; Marwea Al-hedrewy; Egambergan Khudaynazarov; Hayitov Abdulla Nurmatovich

PMC · DOI:10.1038/s41598-026-39031-3·February 10, 2026

Accurate forecasting of photovoltaic optimal points and efficiency using advanced hybrid machine learning models

Anjan Kumar, Md Asif, Malak Naji, B. Spoorthi, Badri Narayan Sahu, S. Radhika, Marwea Al-hedrewy, Egambergan Khudaynazarov, Hayitov Abdulla Nurmatovich

PDF

Open Access

TL;DR

This paper introduces a hybrid machine learning model to accurately forecast solar panel performance metrics, improving energy management and decision-making.

Contribution

A novel hybrid XGBA framework is proposed for predicting photovoltaic performance with high accuracy and robustness.

Findings

01

The hybrid XGBA model achieved R² values of 0.9954 for NOPT and 0.9970 for PCE.

02

Key parameters like Emin, Emax, and Ap were identified as significantly influencing model predictions.

Abstract

Accurate forecasting of photovoltaic performance is essential for improving solar energy management, optimizing operational schedules, and supporting investment decisions. This study proposes a structured data-driven forecasting framework that integrates standalone learners with a hybrid boosting–aggregation strategy to predict two critical photovoltaic performance indicators: the optimal peak operating time (NOPT) and the power conversion efficiency (PCE). The methodology involves systematic data preprocessing, feature normalization, model training using both single and hybrid learners, and performance validation under identical experimental conditions. Multiple data-driven algorithms were examined using comprehensive statistical metrics, including R², RMSE, and U95. Among all models, the hybrid XGBA framework demonstrated superior predictive performance, achieving R2 values of 0.9954…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species4

Delphinapterus leucas(beluga · species)Bacillus sp. AT(species)Homo sapiens(human · species)Chiroptera(bats · order)

Cell lines1

U95— Homo sapiens (Human) · Ataxia telangiectasia syndrome · Finite cell line

Chemicals6

fullerene perovskite polymers BA Ap NFAs

Diseases7

XAI ALE ML RBF IQEs NOPT PICP

Figures14

Click any figure to enlarge with its caption.

Line plot of prediction errors for the selected models.Table [6](#Tab6) presents the results of Dunn’s post hoc test for pairwise model comparisons alongside the Durbin–Watson (DW) statistics to assess the reliability and independence of model residuals. Dunn’s post hoc test is a non-parametric method used for multiple pairwise comparisons following a Kruskal–Wallis test and was selected because the performance metrics, such as RMSE and R², do not necessarily follow a normal distribution. This test evaluates whether the differences in model performance are statistically significant. The Durbin

Taylor diagram for the difference between measured and predicted values.Figures [12](#Fig12) and [13](#Fig13) present a combined sensitivity analysis assessing the impact of the input variables on the output variables NOPT and PCE, respectively. As per Fig. [12](#Fig12), the FAST sensitivity analysis identifies as the variable with the most significant influence on NOPT, exhibiting an 1 value of 1.45, while is the leading variable for PCE predictions with an 1 of 2.2. This indicates that the extreme values of generated electrical energy strongly govern both the optimal timing of peak operation

FAST Sensitivity analysis depicting the effect of input variables on the model output.

Sensitivity analysis for the impact of input variables on the model’s output based on the ALE method.

Scatter matrix plot for the distribution and relationships within the dataset across different feature subsets.

Performance evaluation of the developed models using key statistical metrics. The best-performing models were selected based on their superior accuracy and reliability.Figure [8](#Fig8) compares the scatter plots for NOPT and PCE predictions, which depict the degree to which the models’ predictions are accurate. The points representing the hybrid XGBA model are very close to the best-fit line. They are mostly located within the ± 15% deviation lines, showing that there is a good correlation between the predicted and actual values. In NOPT, this means that the model can accurately predict the o

Scatter plot of predicted versus actual values on the test dataset, showing the performance of the selected models.Table [5](#Tab6) provides an overview of the statistical comparisons of the best hybrid models for both NOPT and PCE targets in the testing phase. The values of NOPT that were measured vary from 0.1 (min) to 10.12 (max), with 4.3182, 3.8950, and 2.8322 being the mean, median, and standard deviation, respectively. The XGBA model is closest to these statistics, indicating that it not only captures the most frequent but also the extreme variations in the number of optimal peak operat

Histogram showing the distribution of prediction errors for the selected models.

Tables1

Table 7. Uncertainty quantification of model performance using confidence intervals.

Process	Target	Models	CI (RMSE)		CI (MDAPE)
Process	Target	Models	Lower (RMSE)	Upper (RMSE)	Lower (MDAPE)	Upper (MDAPE)
Train	nopt	RBF	3.9734	5.2215	4.5753	4.6195
		RF	4.0739	5.0396	4.5382	4.5752
		XGB	4.1232	4.9118	4.5035	4.5316
		RBBA	4.2329	5.0019	4.6029	4.6319
		RFBA	4.3028	4.9795	4.6291	4.6532
		XGBA	4.3175	4.8102	4.5535	4.5742
	PCE	RBF	1.4234	1.7572	1.5059	1.6748
		RF	1.4873	1.7810	1.5669	1.7014
		XGB	1.4629	1.7253	1.5361	1.6521
		RBBA	1.5162	1.7594	1.5642	1.7114
		RFBA	1.5346	1.7246	1.5804	1.6788
		XGBA	1.5543	1.6847	1.5875	1.6515
Validation	nopt	RBF	3.7062	4.9050	4.2836	4.3275
		RF	3.6164	4.6257	4.0866	4.1555
		XGB	3.5934	4.4655	4.0106	4.0483
		RBBA	3.9877	4.3906	4.1727	4.2055
		RFBA	3.8722	4.5890	4.2140	4.2471
		XGBA	3.9243	4.3500	4.1267	4.1476
	PCE	RBF	1.3674	1.6675	1.4456	1.5892
		RF	1.3708	1.6934	1.4410	1.6231
		XGB	1.4287	1.6941	1.5292	1.5936
		RBBA	1.3662	1.6247	1.4297	1.5611
		RFBA	1.4491	1.6422	1.5020	1.5893
		XGBA	1.4554	1.5875	1.4974	1.5455
Test	nopt	RBF	3.8146	4.7504	4.2469	4.3181
		RF	3.7458	4.9039	4.2952	4.3546
		XGB	4.0457	4.5562	4.2889	4.3130
		RBBA	3.9771	4.7653	4.3465	4.3960
		RFBA	3.9551	4.5967	4.2630	4.2887
		XGBA	4.1215	4.4942	4.2968	4.3189
	PCE	RBF	1.5863	1.9420	1.6738	1.8545
		RF	1.6372	1.9743	1.7214	1.8901
		XGB	1.6371	1.8756	1.6816	1.8311
		RBBA	1.6577	1.9262	1.6995	1.8843
		RFBA	1.6946	1.8787	1.7422	1.8311
		XGBA	1.7452	1.8766	1.7616	1.8602

Keywords

Power conversion efficiencyOptimal operating pointPhotovoltaic systemsRenewable energy forecastingSolar energy modelingHybrid machine learningEnergy science and technologyEngineeringMathematics and computing

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSolar Radiation and Photovoltaics · Energy Load and Power Forecasting · Photovoltaic System Optimization Techniques

Full text

Introduction

The global transition to sustainable energy has highlighted photovoltaic (PV) technology as a pivotal solution for reducing greenhouse gas emissions and dependence on fossil fuels^1^. Over the past decades, PV research has focused on enhancing power conversion efficiency (PCE), reducing production costs, and incorporating environmentally friendly materials, such as thin-film polymers and perovskite tandems^2^. The integration of PV systems into diesel-based energy infrastructures, including microgrids, remote power stations, and hybrid vehicles, presents a hybrid solution that can improve fuel efficiency, reduce emissions, and extend engine lifespan^3–5^. Such integration specifically aligns with several Sustainable Development Goals (SDGs), notably SDG 7 (Affordable and Clean Energy) and SDG 13 (Climate Action), by encouraging the use of renewable energy and decreasing the use of diesel^6–8^.

One of the significant breakthroughs in PV technologies is organic photovoltaics (OPVs) that have been the focus of renewed interest in the last couple of years, a fact that can be related to the successful implementation of non-fullerene acceptors (NFAs) allowing single-junction devices to achieve power conversion efficiencies (PCEs) over 18%^9–11^. Nevertheless, there will always be inefficiency-limiting processes in NFA-based systems, which are the main stumbling block for the understanding of these systems that, in turn, obstruct the rationality of human-computer-aided design of new donor–acceptor materials. It has been proven that the quadrupole moment of acceptors is the factor that most strongly influences the interfacial energetics, and high internal quantum efficiencies (IQEs) are generally observed when ionization energy offsets over 0.5 eV are used for exciton dissociation^12^. Research on material modifications can be exemplified by the end-group engineering effected by fluorination and chlorination that, besides allowing charge transport, also makes recombination less likely, resulting in the ‘device’ overall performance ameliorating^13–15^.

Related works

Machine learning (ML) methods are a significant factor in advancing research in renewable energy, especially in forecasting and optimizing photovoltaic (PV) systems^16–19^. Keddouda et al.^20^ developed artificial neural network (ANN) and regression models using meteorological data and operating temperature as inputs, achieving high predictive accuracy with R² values reaching 0.998. Kumari and Toshniwal^21^ conceived extreme gradient boosting with deep neural network (XGBF-DNN), which essentially integrates extreme gradient boosting forests and deep neural networks by the vehicle of ridge regression, thereby soaring not only the security but also the accuracy of a wide variety of climatic conditions. The use of such ensemble strategies underscores the viability of hybrid ML frameworks for addressing the unpredictability of PV system outputs in the real world.

Nonetheless, interpretability remains a major stumbling block, despite progress in predictive capabilities. Some XAI (Explainable AI) techniques, like SHAP^22^ and LIME^23^, can provide an account of feature importance and develop local explanations; however, they are still largely untapped in PV research. Chen et al.^24^ pointed out the difficulties related to terminology, cross-task evaluation, and the range of existing interpretability techniques; therefore, they suggested that more research should be carried out to enhance the transparency of the processes. Scott et al.^25^ examined the use of benchmark machine learning algorithms to forecast photovoltaic power generation for building-scale renewable energy systems. Several models, including random forest, neural networks, support vector machines, and linear regression, were compared using operational data from a university campus to evaluate forecasting accuracy across different dataset sizes and prediction horizons. The results showed that random forest achieved the lowest average error, although no single algorithm consistently outperformed the others under all conditions. The study highlighted the importance of dataset characteristics and model usability when selecting forecasting approaches for integration into building management systems. Bhutta et al.^26^ investigated the use of hybrid machine learning models to improve the prediction accuracy of solar power generation within smart grid systems. Hybrid deep learning architectures, including convolutional–recurrent, convolutional–LSTM, and convolutional–GRU networks, were applied to forecast key solar plant parameters such as power production, plane-of-array irradiance, and performance ratio. The results demonstrated that the hybrid convolutional–LSTM model achieved the highest predictive accuracy, yielding the lowest RMSE and MAE values across all evaluated variables. The findings indicated that hybrid machine learning approaches were effective in enhancing the efficiency and reliability of solar power generation forecasting in intelligent energy networks. Ridha et al.^27^ proposed a hybrid photovoltaic power prediction framework integrating singular spectrum analysis, an adaptive beluga whale optimization algorithm, and an improved extreme learning machine. Singular spectrum analysis was applied to preprocess long-term PV time-series data, while the adaptive beluga whale optimization method was used to enhance exploration–exploitation balance and optimize model hyperparameters. The improved extreme learning machine further refined output weight estimation to enhance prediction accuracy. Comparative evaluations using benchmark functions and real-world PV data demonstrated that the proposed hybrid model outperformed existing optimization and hybrid learning approaches across multiple statistical performance metrics.

Although ML and hybrid models have achieved high accuracy in photovoltaic forecasting, existing studies mainly focus on single performance indicators and accuracy-driven evaluation. The simultaneous prediction of optimal operating time and efficiency, along with uncertainty-aware validation and robustness assessment, remains largely unexplored. Moreover, despite advances in hybrid learning, model interpretability and sensitivity-based physical insight are insufficiently integrated into PV forecasting frameworks. These limitations highlight the need for a unified, transparent, and decision-oriented modeling approach that balances accuracy, reliability, and practical applicability. In addition, the proposed XGBA model addresses the methodological gap in existing hybrid PV forecasting approaches by enabling simultaneous multi-target prediction, improving robustness and uncertainty-aware performance, and integrating sensitivity-based interpretability for enhanced operational insight.

Novelty and objective

The primary task in this research is to formulate a hybrid machine learning system capable of accurately predicting solar energy parameters, such as the number of optimal peak operating times (NOPT) and power conversion efficiency (PCE). The accurate forecasting of these targets can lead to better management of solar energy resources, higher-quality service, and easier financial planning. Standard single models often fail to capture the complex relationships between environmental variables and energy outputs. To address this problem, the paper presents the concept of cooperation models, the result of the successful interaction of multiple learning paradigms. Such a combination is based on integrating tree-based algorithmic predictability with metaheuristic optimization strategies, e.g., simulated annealing or genetic algorithms. The most important part of the proposed method is the role the Bat Algorithm (BAT) plays as a tuner. The BAT optimizer is a metaheuristic approach inspired by the echolocation behavior of bats. The advantages include effective exploitation and space searching, fast convergence, and high adaptability to the problem. Moreover, the possibility of balancing global search and regional refinement allows it to be used for adjusting the parameters of hybrid machine learning models. In effect, the performance of different targets, i.e., NOPT and PCE, is improved. In addition to this, this paper also relies on a few sensitivity analysis techniques systematically to explore the effects the input variables exert on the model outputs. The FAST sensitivity methodology gives a comprehensive interpretation of the variable importance. In contrast, the Accumulated Local Effects (ALE) method measures the effect of each input on the predicted outputs regardless of the underlying model. Besides that, post hoc statistical tests such as Dunn’s test are applied to support the significance and independence of model predictions. The joint use of multiple sensitivity tools enables the proposed models not only to be accurate but also interpretable, allowing the identification of critical factors affecting solar energy performance. Figure 1 shows the process of the study.

Fig. 1. Process of the present study.

Mathematical methods

Radial basis function (RBF)

The RBF network, a member of the Artificial Neural Networks (ANNs) family, links input and output components without the use of mathematical formulae. Instead, it infers the model’s structure and unknown parameters only from the data^28^. The RBF network consists of three layers: input, hidden, and linear output. As input vectors pass through the hidden layer, they undergo transformations that result in radial basis functions. These procedures use an activation mechanism based on the Gaussian distribution and have a solid basis in the properties of the Gaussian function. According to the literature, the Gaussian basis function ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\mathcal{G}}_{j}$$\end{document}$ ) is defined by two essential parameters: width and center^29^. The following is an expression for the function:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\mathcal{G}}_{j}\left(x\right)=exp\left(\frac{{\left|x-{\gamma\:}_{j}\right|}^{2}}{2{\omega\:}_{j}^{2}}\right)$$\end{document}

The width and center of the Gaussian basis function are denoted by $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\omega\:}_{j}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\gamma\:\:}_{j}$$\end{document}$ , respectively, while $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:x$$\end{document}$ is the input pattern. The output neuron is commonly represented by:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:y\left(x\right)={\sum\:}_{j=1}^{n}{U}_{j}{\mathcal{G}}_{j}\left(x\right)+\mathfrak{B}$$\end{document}

Here, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{U}_{j}$$\end{document}$ is the weight factor that connects the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:j$$\end{document}$ th hidden neuron to the output neuron, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\mathfrak{B}$$\end{document}$ is the bias coefficient, and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:n$$\end{document}$ is hidden neuron’s numbers. Figure 2 shows how the RBF model works using a flowchart.

Fig. 2. Flowchart of the RBF.

eXtreme gradient boosting regression (XGBR)

XGBoost, a supervised learning technique, was used to train models for forecasting missing laboratory test data. Because of its effectiveness in model training, the extended distributed gradient boosting package XGBoost was chosen^30^. This approach employs an adaptive binary splitting algorithm to iteratively select the optimal split at each stage, thereby producing an ideal model. Model selection procedures are enhanced by XGBoost’s resistance to overfitting and outliers due to its tree-based structure. Equation (3) defines the normalized goal of the XGBoost model during the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:s$$\end{document}$ th training phase. The loss function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\mathcal{L}\mathcal{f}\left({{x}^{\left(s\right)}}_{\mathcal{p}},\:{x}_{gt}\right)$$\end{document}$ measures the difference between the predicted value $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{{x}^{\left(s\right)}}_{\mathcal{p}}$$\end{document}$ and the corresponding ground truth $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{x}_{gt}$$\end{document}$

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\mathcal{L}\mathcal{f}}^{\left(s\right)}=\sum\:_{i}\mathcal{l}\left({{x}^{\left(s\right)}}_{\mathcal{p}},\:{x}_{gt}\right)+\sum\:_{q}{\Omega\:}\left({f}_{k}\right)$$\end{document}

$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\left\| {\omega \:} \right\|^{2}$$\end{document}$ represents the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\mathcal{L}\mathcal{f}2$$\end{document}$ norm of all leaf scores. The regularizer $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\Omega \:\left( {f_{k} } \right) = \gamma \:T + \frac{1}{2}\lambda \:\left\| \omega \right\|\:^{2}$$\end{document}$ represents the complexity of the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:q$$\end{document}$ th tree. The parameters control the accuracy of the tree search, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\gamma\:$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\lambda\:$$\end{document}$ . Moreover, Fig. 3 shows the flowchart of the XGBR model.

Fig. 3. Structure of the XGB model.

Random forest regression (RFR)

Averaging predictions from hundreds or even thousands of decision trees is how the random forest algorithm, an ensemble approach, creates multiple trees for regression. Each tree is derived from the Classification and Regression Tree (CART), which was first presented by Breiman et al.^31^. Data complexity shapes the learning process that each tree goes through. A decision tree is made up of decision and leaf nodes. According to Eq. (4), the input vector $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:X=\{{x}_{1},{x}_{2},\dots\:,{x}_{m}\}$$\end{document}$ maps to a scalar output $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:Y$$\end{document}$ using a training set of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:n$$\end{document}$ observations ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{R}_{n}$$\end{document}$ ).

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{R}_{n}=\left\{\left({X}_{1},{Y}_{1}\right),\:\left({X}_{2},{Y}_{2}\right),\:\dots\:,\:\left({X}_{n},{Y}_{n}\right)\right\},\:X\in\:{R}^{m},\:Y\in\:R.$$\end{document}

By splitting the input data at each node until it reached a terminal leaf or satisfied stopping conditions, like a minimum sample size or maximum depth, the algorithm optimized split functions during the training phase. A prognostic function $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\widehat{H}=(X,{R}_{n})$$\end{document}$ that can forecast results was created by this process. An ensemble of tree-structured base classifiers $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:H=(X,{\varTheta\:}_{K})$$\end{document}$ was developed in Random Forest Regression^32^, where each $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\varTheta\:}_{K}$$\end{document}$ denoted a random vector that identified a bootstrap sample of the training data or a subset of features. To ensure an equal selection probability, bootstrap sampling entailed drawing n observations from $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{R}_{n}$$\end{document}$ with replacement. This process was repeated across several bootstrap sets by the bagging procedure, producing a separate prediction tree for each set. The result was a set of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:q$$\end{document}$ trees $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\widehat{h}\left(X,\:{S}_{n}^{{{\Theta\:}}_{1}}\right),\:...,\:\widehat{h}\left(X,\:{S}_{n}^{{{\Theta\:}}_{q}}\right)$$\end{document}$ . In contrast to a single decision tree, the outputs from all trees were averaged to produce the final predicted value, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\widehat{Y}$$\end{document}$ , which improved accuracy and decreased variance^33^.

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\widehat{Y}=\frac{1}{q}\sum\:_{l=1}^{q}{\widehat{Y}}_{l}=\frac{1}{q}\sum\:_{l=1}^{q}\widehat{H}\left(X,\:{R}_{n}^{{{\Theta\:}}_{l}}\right)$$\end{document}

The output of the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:l$$\end{document}$ th tree is denoted by $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\widehat{Y}}_{l}$$\end{document}$ , where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:l$$\end{document}$ takes values between 1 and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:q$$\end{document}$ .

By integrating bagging with ensembles of unpruned decision trees, Random Forest (RF) regression improves model robustness^32,33^. RF is a computationally efficient method because it doesn’t require pruning, unlike other approaches. Only two parameters need to be adjusted for it to be simple: the number of trees ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{n}_{tree}$$\end{document}$ ) and the number of randomly chosen predictors for every split ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{m}_{try}$$\end{document}$ )^34^. In general, adding more trees increases accuracy and stability, but eventually, there comes a point at which more trees are no longer able to reduce error. Typically, a standard value of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{n}_{tree}$$\end{document}$ =500 is used. In addition to strengthening trees, increasing $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{m}_{try}$$\end{document}$ also makes trees more correlated with one another^35^. Approximately two-thirds of the original dataset is included in each of the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{n}_{tree}$$\end{document}$ bootstrap samples that are created during the RF process. To ensure diversity among trees, the optimal split is determined at each node using a random subset of predictors ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{m}_{try}$$\end{document}$ ). While out-of-bag (OOB) samples, which are data not included in bootstrap sets, are used for validation to lower the risk of overfitting, predictions are aggregated through averaging for regression tasks. Figure 4 illustrates the application of the RF regression framework for prediction.

Fig. 4. Flowchart of the RF model.

Overview of the BAT search algorithm

The echolocation method used by wild bats to find food served as the model for the BAT search algorithm. It was first presented by Yang^36–39^ and is used to solve several optimization issues. Every virtual bat in the original population updates its position using echolocation in a homologous fashion. Bats use a perceptual mechanism called echolocation, which produces echoes by releasing a sequence of loud ultrasonic waves. Bats can identify a particular prey by using the delays and different sound levels that these waves return. A few guidelines are being researched to expand the BAT algorithm’s structure and take advantage of bats’ echolocation traits^40–43^.

(a) Every bat uses echolocation features to differentiate between obstacles and prey; (b) Every bat flies at random with loudness $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{E}_{0}$$\end{document}$ and velocity $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{k}_{i}$$\end{document}$ at position $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{x}_{i}$$\end{document}$ with a fixed frequency $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{f}_{min}$$\end{document}$ varying wavelength $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\lambda\:$$\end{document}$ to find prey; it controls the frequency of its released pulse and modifies the rate of pulse release $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:r$$\end{document}$ in the range of [0,1], depending on how close its aim is; (c) Every bat varies its frequency, loudness, and pulse release rate; (d) The loudness $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{E}_{m}^{iter}$$\end{document}$ shifts from a significant value $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{E}_{0}$$\end{document}$ to a minimum constant value $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{E}_{min}$$\end{document}$ ; Throughout the optimization process, each bat’s position $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{x}_{i}$$\end{document}$ and velocity $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{v}_{i}$$\end{document}$ should be specified and updated; the new solutions $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{x}_{i}^{t}$$\end{document}$ and velocities $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{k}_{i}^{t}$$\end{document}$ at time step $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:t$$\end{document}$ are carried out by the following Eqs.^44,45^:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{f}_{i}={f}_{min}+\left({f}_{max}-{f}_{min}\right)\phi\:$$\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{k}_{i}^{t}={k}_{i}^{t-1}+\left({x}_{i}^{t}-{x}^{*}\right){f}_{i}$$\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{x}_{i}^{t}={x}_{i}^{t-1}+{k}_{i}^{t}$$\end{document}

Where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\phi\:$$\end{document}$ is a random vector selected from a uniform distribution and falls between 0 and 1, after analyzing all of the positions among all $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:n$$\end{document}$ bats, the current global best location is $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{x}^{*}$$\end{document}$ . One may use either $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{f}_{i}$$\end{document}$ (or $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\lambda\:}_{i}$$\end{document}$ ) to adjust the velocity change while setting the other component, as the velocity increment is the product of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\lambda\:}_{i}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{f}_{i}$$\end{document}$ . Each bat is given a frequency at random for implementation, which is uniformly selected from ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{f}_{min}$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{f}_{max}$$\end{document}$ ). Following the selection of one of the existing top solutions for the local search, a random walk is used to produce a new solution for every bat locally.

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{x}_{new}={x}_{old}+\epsilon\:{E}^{t}$$\end{document}

Where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:t$$\end{document}$ is the average loudness of all bats at this time step and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\epsilon\:$$\end{document}$ is a random value that falls between 1 and 1. The volume may be set to any convenient number since, after a bat has located its prey, the loudness typically falls while the rate of pulse emission rises. Considering that $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{E}_{min}=0$$\end{document}$ indicates that a bat has just discovered its victim and has momentarily stopped making noise, one obtains:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{E}_{i}^{t-1}={\beta\:E}_{i}^{t},\:\:{r}_{i}^{t+1}={r}_{i}^{0}\left[1-exp\left(-\gamma\:t\right)\right]$$\end{document}

Where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\gamma\:$$\end{document}$ is a positive constant and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\beta\:$$\end{document}$ is a constant in the interval [0,1]. The loudness tends to be zero as time approaches infinity, and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{r}_{i}^{t}$$\end{document}$ equals $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\gamma\:}_{i}^{0}$$\end{document}$ .

Evaluation metrics

This article provides details on various statistical metrics that account for the accuracy of predicting peak times (NOPT) and power conversion efficiency (PCE) in solar energy systems. One of these metrics is the coefficient of determination (R²), which is the measure of agreement between actual and predicted values, where a number close to one indicates a strong match. For instance, a solar module with an actual PCE of 18.5% and an expected value of 18.3% will exhibit a high R², indicating a perfect match between the two values. Besides this, the root mean square error (RMSE) gives the average size of differences between the values. If we consider an example where 49 NOPT is predicted instead of the actual 50, this will have a minimal impact on RMSE. The 95% confidence level uncertainty (U95) indicates prediction stability and helps to ensure that long-term forecasts are reliable. Correspondingly, MRAE and MDAPE are measures of error in percentage that are normalized and robust. At the same time, the prediction interval coverage probability (PICP) is a criterion that checks whether the actual NOPT or PCE values fall within the model’s predicted bounds. The mathematical formulations of the employed evaluation metrics are presented in Eqs. (11) to (16).

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${R^2}={\left( {\frac{{\mathop \sum \nolimits_{{i=1}}^{n} \left( {{t_i} - \bar {t}} \right)\left( {{p_i} - \bar {p}} \right)}}{{\sqrt {\left[ {\mathop \sum \nolimits_{{i=1}}^{n} {{\left( {{t_i} - \bar {p}} \right)}^2}} \right]\left[ {\mathop \sum \nolimits_{{i=1}}^{n} {{\left( {{p_i} - \bar {p}} \right)}^2}} \right]} }}} \right)^2}$$\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$RMSE=\sqrt {\frac{1}{n}\mathop \sum \limits_{{i=1}}^{n} {{\left( {{p_i} - {t_i}} \right)}^2}}$$\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$U95=\sqrt {\mathop \sum \limits_{{i=1}}^{n} {{\left( {{\mathrm{~}}{P_i} - {\mathrm{~}}\bar {P}} \right)}^2}/\left( {n*\left( {n - 1} \right)} \right)}$$\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$MRAE=\frac{1}{n}\mathop \sum \limits_{{i=1}}^{n} \frac{{\left| {{T_i} - {P_i}} \right|}}{{\left| {{T_i} - \bar {T}} \right|}}$$\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$MDAPE=median\left( {\frac{1}{n}\mathop \sum \limits_{{i=1}}^{n} \left| {\frac{{{p_i} - {t_i}}}{{{p_i}}}} \right| \times 100\% } \right)$$\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$PICP=\frac{1}{n}\mathop \sum \limits_{{i=1}}^{n} {k_i} \times 100\% ,~~{k_i}=\left\{ {\begin{array}{*{20}{c}} {1,~~{p_i} \in \left[ {lo{w_i},~u{p_i}} \right]} \\ {0,~~{p_i} \notin \left[ {lo{w_i},~u{p_i}} \right]} \end{array}} \right.~~$$\end{document}

Where, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{t}_{i}$$\end{document}$ is observed (actual) solar energy value at instance $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:i$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{p}_{i}$$\end{document}$ is predicted solar energy value at instance $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:i$$\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\stackrel{-}{t}$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\stackrel{-}{p}$$\end{document}$ are the mean of observed and predicted values, respectively. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:n$$\end{document}$ denotes the total number of observations. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:\left[{low}_{i},\:{up}_{i}\right]$$\end{document}$ are lower and upper prediction interval bounds for the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:i$$\end{document}$ th prediction, and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{k}_{i}$$\end{document}$ demonstrates the indicator variable, equal to 1 if the observed value lies within the prediction interval, and zero otherwise.

Rationale for the hybridization strategy

The hybridization strategy adopted in this study is designed to enhance nonlinear pattern learning by combining the complementary strengths of different learning paradigms rather than relying on a single-model structure. Single machine learning models, such as kernel-based learners or tree-based algorithms, are effective in capturing specific types of relationships; however, they are inherently limited in representing the full complexity of photovoltaic system behavior, which is governed by highly nonlinear, nonstationary, and interacting environmental and operational variables. In the proposed hybrid framework, gradient boosting models act as strong base learners capable of capturing high-order nonlinear interactions and abrupt regime changes, while the adaptive aggregation mechanism integrates multiple weak and strong predictors to reduce bias and variance simultaneously.

This fusion enables the model to learn both global trends and localized nonlinear responses, which are common in PV systems due to fluctuating irradiance, temperature-dependent efficiency, and extreme energy generation. Hybridization improves learning performance by mitigating the weaknesses of individual models. While single models may overfit local patterns or underperform in extrapolation regions, the fusion strategy stabilizes predictions through ensemble averaging and adaptive weighting, thereby improving generalization and robustness. This is particularly important for small-to-moderate datasets, where individual learners may exhibit high variance. Furthermore, the hybrid framework enhances error correction, as other models in the ensemble can compensate for mispredictions from a single model. This mechanism explains the observed reductions in RMSE and uncertainty bounds, as well as the consistent performance across the training, validation, and test datasets. Compared to single-model baselines, the hybrid approach demonstrates superior capability in learning complex nonlinear relationships while maintaining interpretability and stability, making it especially suitable for simultaneous forecasting of NOPT and PCE. As a result, the hybridization strategy directly addresses the limitations of standalone models and provides a more reliable and scalable solution for real-world photovoltaic system forecasting.

Code availability

All data preprocessing procedures, hybrid machine learning model implementations (including the XGBA framework), training scripts, and evaluation workflows used in this study were custom-developed and implemented in Python. To ensure transparency and reproducibility, the complete source code, including model configurations, parameter settings, and execution instructions, is available from the corresponding author upon reasonable request. Requests for access can be directed to: [email protected]. The code is provided for academic and research purposes without restriction.

Dataset description

The dataset in this research includes 305 records with seven input variables, namely Ap, Amin, Amax, Ep, Emin, Emax and nyield, and the targets for prediction are the number of peak times (NOPT) and the power conversion efficiency (PCE), expressed as percentages. The dataset, obtained from^46^, was partitioned into training (70%), validation (15%), and testing (15%) subsets. To ensure reproducibility and prevent data leakage, the dataset was explicitly partitioned into three mutually exclusive subsets: 70% (214 samples) for model training, 15% (46 samples) for validation, and 15% (45 samples) for independent testing. This splitting strategy was selected to provide sufficient samples for learning model parameters while reserving adequate data for unbiased hyperparameter tuning and final performance assessment. The validation subset was used exclusively for model selection and hyperparameter optimization, whereas the test subset remained completely unseen during the training and tuning phases. This strict separation ensures that reported test results reflect true generalization performance rather than memorization effects. Moreover, data splitting was performed in a deterministic, reproducible manner, and the same partitions were consistently applied across all single and hybrid models to ensure fair, transparent comparisons. This structured training–validation–testing workflow minimizes the risk of optimistic bias and aligns with best practices in machine learning–based photovoltaic performance modeling.

According to Table 1, the variables characterize various operational and environmental conditions related to solar energy systems. Specifically:

Ap expresses the peak absorption wavelength measured under standard test conditions (nm).
Amin and Amax represent the minimum and maximum absorption wavelength during the measurement period, reflecting daily and seasonal variations in sunlight exposure.
Ep denotes the peak emission wavelength (in nm) produced under the measured irradiance conditions.
Emin and Emax indicate the minimum and maximum emission wavelength across different operating regions, capturing fluctuations due to environmental and system variations.
is the absolute emission quantum yield, ranging from 0 to 100, calculated as the ratio of actual energy output to the available solar resource, reflecting system performance efficiency.

The target variables quantify predictive objectives:

NOPT, with a maximum value of 12.91%, indicates the number of optimal peak operating times for the PV system.
PCE, with a maximum of 4.36%, measures the efficiency with which solar irradiance is converted into electrical energy.

All measurements were collected using calibrated pyranometers for irradiance and standard energy meters for electrical output, ensuring accurate representation of environmental and operational conditions. Hence, this dataset not only signifies environmental variations but also captures system performance metrics, providing a solid foundation for building and validating predictive models. Before training and evaluating the predictive models, the raw dataset underwent a systematic preprocessing workflow to ensure data quality, consistency, and compatibility with machine learning algorithms. First, all input variables were normalized using min–max scaling to map values to 0–1, preventing features with larger numerical ranges from dominating the learning process. Second, missing values were handled using a two-step approach: records with minor missing entries (< 5% of the dataset) were imputed using linear interpolation based on neighboring temporal values, while records with substantial missing information were excluded to avoid introducing bias. This ensured that the final dataset retained meaningful variability without compromising integrity. Third, noise filtering was applied to smooth transient fluctuations in energy and irradiance measurements. A moving average filter with a window size of 3 was applied to the input features , , , , , and to reduce measurement noise while preserving significant trends relevant to model training.

Table 1. Overview of input features and output variables with their statistical properties.Data roleVariablesCharacteristicsMaxMinMeanMedianSt. DevGeneral optical features Ap 9001.6459.2567500213.1361 Amin 9000.1395.2109475263.8159 Amax 9000.00078348.5977450297.1935 Ep 9000.00078310.3975225302.228 Emin 9000.00078255.462356301.0468 Emax 9000.00078194.65510.05305.5484 nyield 1000.0007819.953913.9428.45924Target optical features NOPT (%) 12.910.000783.0526062.32.581743 PCE (%) 4.360.000781.6271391.50.993607

Figure 5 shows a scatter matrix, which displays the distributions and pairwise relationships of the features in the dataset. On the diagonal, each panel represents distributions of individual variables, whereas off-diagonal plots show possible correlations and grouping between pairs of features. The nopt values are distributed mainly between 0 and 4, so most samples are within this range. Likewise, the PCE values are primarily concentrated between 0 and 4.4, consistent with their distribution in the dataset. By and large, the matrix delineates variable ranges, uncovers potential relationships (linear or nonlinear) and regions of concentration, thus giving a brief indication of feature behavior, which is handy for exploratory data analysis.

Fig. 5. Scatter matrix plot for the distribution and relationships within the dataset across different feature subsets.

Analysis of prediction results

The computational complexity and training time of the proposed models were systematically analyzed to assess their practical feasibility and scalability. The runtime results clearly demonstrate the computational trade-off introduced by BA–based optimization across all models and both target variables (NOPT and PCE). In all cases, incorporating BA increased execution time by approximately 3–5 times compared with the corresponding base models, attributable to the iterative population-based search mechanism and the repeated fitness evaluations inherent to metaheuristic optimization techniques. Among the evaluated models, RBF consistently exhibited the lowest computational cost, both in its base configuration and when coupled with BA. For NOPT prediction, the RBF model required 9.47 s in the base form and 48.29 s with BA optimization, while for PCE prediction, the runtime remained similarly low (10.36 s in the base form and 52.87 s with BA). This behavior reflects the simpler mathematical structure and lower training complexity of kernel-based models, making RBF computationally efficient even under optimization. The XGBoost-based models showed moderate computational demand, with base runtimes of 16–18 s, increasing to 62–66 s after BA optimization. The additional overhead primarily stems from repeated tree construction, gradient boosting iterations, and hyperparameter evaluations during the optimization process.

In contrast, Random Forest exhibited the highest computational cost, particularly in its optimized form, with runtimes reaching 80–84 s, due to the large ensemble size, bootstrap sampling, and repeated evaluation of tree-based structures across BA iterations. From a scalability perspective, the observed computational trends indicate that training time grows approximately linearly with dataset size for RBF and near-linearly to moderately superlinearly for tree-based ensemble models. While BA-based hybridization introduces additional overhead, this cost is incurred offline during model development and optimization, whereas online inference remains computationally lightweight, enabling real-time deployment in photovoltaic monitoring systems. Regarding scalability to larger PV datasets and different climate zones, the proposed hybrid framework is inherently extensible. Larger datasets are expected to improve generalization while increasing training time proportionally, particularly for ensemble models. However, the modular design of the hybrid approach allows parallelization of BA fitness evaluations and tree construction, making it suitable for high-performance or cloud-based computing environments. Moreover, the data-driven nature of the models enables adaptation to diverse climatic conditions, provided that representative environmental and operational data from different regions are included during training.

Fig. 63D wall plot illustrating the convergence behavior of the optimization process across iterations or parameters.The random search procedure was conducted using predefined hyperparameter ranges that were selected based on model-specific constraints, prior literature, and preliminary sensitivity trials to ensure both computational feasibility and sufficient exploration of the solution space. For kernel-based hybrid models (RBBA), the length scale was sampled from a continuous logarithmic range of [10⁻³, 10¹], while the lower and upper bounds of the length scale were drawn from [10⁻⁵, 10⁻²] and [10³, 10⁶], respectively, allowing the model to capture both smooth and highly nonlinear functional relationships. For hybrid models (RFBA and XGBA), the number of estimators was randomly sampled from the interval [20, 1000], enabling evaluation of ensemble sizes from small to large. The maximum tree depth was explored within the range [5, 1000] to assess the trade-off between model expressiveness and overfitting risk, while the minimum number of samples required to split a node was sampled from [2, 150] to regulate tree granularity and stability. For boosting-based hybrids (XGBA), the learning rate was sampled from the continuous range [0.01, 0.9] to balance convergence speed and generalization performance. In addition, the column sampling rate per tree (colsample_bytree) was varied within [0.5, 1.0] to enhance feature diversity and reduce correlation among trees, and the number of leaves was sampled from [10, 100] to control the complexity of individual boosting trees.

From a computational standpoint, the random search was executed for a fixed budget of 200 independent hyperparameter evaluations per model–target pair, ensuring consistent and fair optimization across all frameworks. Each candidate configuration was trained on the training subset and evaluated exclusively on the validation subset using RMSE and R² as the primary selection criteria. Table 2 summarizes the hyperparameters optimized for the hybrid models used to predict solar energy targets, specifically NOPT and PCE. The hyperparameters are length scale, length scale bounds (lower and upper), number of estimators, maximum tree depth, minimum samples required to split a node, learning rate, colsample by tree for NOPT, and number of leaves for PCE. The model’s flexibility, complexity, and learning dynamics are controlled by these parameters, which in turn aim to achieve prediction accuracy as the ultimate goal. For example, the RBBA model has a length scale of 3.9516 for NOPT and 2.1531 for PCE, indicating the degree of smoothness of the underlying regression function. In tree-based hybrid models, the number of estimators for RFBA and XGBA are 321 and 246 for NOPT, and 846 and 21 for PCE, respectively, so that the differences in ensemble size and their effects on predictive performance are clear. All experiments were conducted under identical computational settings to ensure fair model comparison and reproducibility. The implementations were executed on a workstation equipped with an Intel^®^ Core™ i7 processor, 32 GB RAM, and a 64-bit Windows operating system. The models were implemented using Python (v3.9) with Scikit-learn, XGBoost, and NumPy libraries, which are widely adopted in ML research.

Table 2. The hyperparameters of the hybrid models, along with their assigned values.TargetHyperparameterHybrid modelsRBBARFBAXGBAnoptlength_scale3.9516××length_scale_bounds (lower)5.47E-05××length_scale_bounds (upper)321,684××n_estimator×321246max_depth×89177min_samples_split×126×learning_rate××0.894651colsample_bytree××8PCElength_scale2.153142××length_scale_bounds (lower)0.00003××length_scale_bounds (upper)205,465××n_estimator×84621max_depth×54856min_samples_split×22×num_leaves××76learning_rate××0.84651

To address concerns regarding potential overfitting due to the small dataset size (305 samples), a 5-fold cross-validation procedure was implemented on three representative single models: RBF, RF, and XGB. Table 3 shows the 5-fold cross-validation results for the single models. The 5-fold results demonstrate consistent performance across folds, indicating robust generalization ability.

Table 3. Result of the 5-fold cross-validation performance of single models.ModelTargetR² (Mean ± Std)RMSE (Mean ± Std)RBFNOPT0.969 ± 0.0060.482 ± 0.052PCE0.977 ± 0.0050.150 ± 0.008RFNOPT0.969 ± 0.0080.472 ± 0.057PCE0.978 ± 0.0060.153 ± 0.010XGBNOPT0.986 ± 0.0040.351 ± 0.021PCE0.981 ± 0.0040.135 ± 0.007

Table 4 summarizes the comprehensive performance of both single and hybrid models in forecasting the NOPT and PCE. The evaluation used a suite of statistical indicators, including R², RMSE, PICP, U95, MRAE, and MDAPE, to ensure a rigorous assessment of predictive accuracy and reliability. Among the single models, the XGB framework consistently outperformed RBF and RF, yielding the lowest error rates across RMSE, MRAE, and MDAPE, which highlights its superior ability to approximate the underlying solar energy dynamics. Nevertheless, the hybrid configurations markedly advanced the prediction quality beyond that of the standalone models. In particular, the XGBA model achieved exceptional results, with R² values of 0.9954 for NOPT and 0.9970 for PCE, thereby capturing nearly all variability observed in the actual system behavior. Furthermore, its minimal uncertainty values (U95 = 0.5346 for NOPT and 0.1526 for PCE) underscore the robustness and stability of its forecasts. These outcomes demonstrate that the XGBA model not only minimizes deviation from ground truth but also ensures reliable and consistent estimations, which are indispensable for effective scheduling, energy resource allocation, and risk reduction in solar energy management. Collectively, the results affirm the superiority of hybrid learning strategies, particularly XGBA, in providing both accuracy and resilience for practical decision-making in renewable energy systems.

Table 4. Performance metrics of the models, assessing their predictive accuracy and effectiveness using key statistical indicators.ProcessTargetFrameworkModelsEvaluation metricsR^2^RMSEPICPU95MRAEMDAPETrainnoptSingle RBF 0.95210.6230.9051.7250.1851.250 RF 0.97160.4840.8951.3420.1351.071 XGB 0.98090.3970.8951.0940.1061.042Hybrid RBBA 0.98380.3870.8951.0700.1401.071 RFBA 0.98830.3360.8950.9240.0831.071 XGBA 0.99230.2500.8950.6910.0771.000PCESingle RBF 0.97100.1660.9050.4570.7026.667 RF 0.97730.1500.9050.4150.6665.556 XGB 0.98120.1350.9520.3710.5035.556Hybrid RBBA 0.98320.1280.8950.3540.5225.500 RFBA 0.99060.0960.9050.2660.4294.444 XGBA 0.99440.0720.9050.2000.3143.846ValidationnoptSingle RBF 0.98030.5980.8641.6300.1251.055 RF 0.97420.5000.8641.3850.0901.055 XGB 0.98440.4330.8641.1710.0891.055Hybrid RBBA 0.99700.1890.8640.5210.0300.799 RFBA 0.98770.3520.8640.9670.0700.520 XGBA 0.99530.2150.8640.5930.0451.043PCESingle RBF 0.98480.1500.8640.4160.4416.786 RF 0.98120.1640.8640.4540.5578.846 XGB 0.98660.1500.8640.4060.5077.768Hybrid RBBA 0.99170.1240.9550.3380.2826.250 RFBA 0.99320.1010.8640.2740.2295.000 XGBA 0.99630.0710.9550.1970.1204.473TestnoptSingle RBF 0.97290.4690.8641.2980.1090.935 RF 0.96310.5810.8641.6100.1181.167 XGB 0.99100.2710.9090.7480.0770.875Hybrid RBBA 0.98350.3900.9091.0760.0980.935 RFBA 0.98820.3130.8640.8630.0580.810 XGBA 0.99540.1930.8640.5350.0530.935PCESingle RBF 0.96850.1750.9090.4841.0007.463 RF 0.96900.1740.8640.4810.8847.350 XGB 0.98440.1320.9090.3600.3205.000Hybrid RBBA 0.98230.1380.8640.3820.6006.154 RFBA 0.98890.0980.8640.2720.3024.199 XGBA 0.99700.0590.9090.1530.3093.372

Figure 7 shows a comparative graphical representation of the effectiveness of each model using the evaluation metrics, and it identifies the gap between single and hybrid models in terms of NOPT and PCE forecasting. With the value of the metric R² taken as an example, we can determine that the RF model is shown by having the lowest correlation; hence, it has a lower predictive ability; in short words, the respective model’s predictions deviate more from the actual peak operating times and PCE measured in real solar energy systems. As for RMSE, all the models yield lower errors for PCE than for NOPT, indicating that power conversion efficiency is a more accurate predictor than the number of peak times. The same direction can be drawn from the U95 values, which reveal that the predictions are more stable for PCE. On the other hand, MRAE and MDAPE scores are higher for PCE, which signifies that the values of relative and percentage errors are greater for peak time predictions. As for PICP, RBF, XGBA, and XGB are the models that allow the highest coverage for PCE, while XGB and RBBA are the best performers for NOPT, which indicates that these models offer the most reliable probabilistic forecasts in real-world solar energy applications.

Fig. 7. Performance evaluation of the developed models using key statistical metrics. The best-performing models were selected based on their superior accuracy and reliability.Figure 8 compares the scatter plots for NOPT and PCE predictions, which depict the degree to which the models’ predictions are accurate. The points representing the hybrid XGBA model are very close to the best-fit line. They are mostly located within the ± 15% deviation lines, showing that there is a good correlation between the predicted and actual values. In NOPT, this means that the model can accurately predict the optimal peak operating times. With PCE, the forecast is close to the actual power conversion efficiency of solar modules. From a financial perspective, such dependable projections enable solar farm managers and investors to schedule energy production more accurately, thereby making better use of resources and allowing for a higher level of confidence in revenue estimation. Correct predictions of peak times and efficiencies become the basis for making operational decisions that involve the organization of maintenance, energy trading, and capacity planning, all of which lead to a reduction in the economic risk and an increase in the overall profitability.

Fig. 8. Scatter plot of predicted versus actual values on the test dataset, showing the performance of the selected models.Table 5 provides an overview of the statistical comparisons of the best hybrid models for both NOPT and PCE targets in the testing phase. The values of NOPT that were measured vary from 0.1 (min) to 10.12 (max), with 4.3182, 3.8950, and 2.8322 being the mean, median, and standard deviation, respectively. The XGBA model is closest to these statistics, indicating that it not only captures the most frequent but also the extreme variations in the number of optimal peak operating times. The range of measured PCE values is from 0.15 to 4.1, with mean, median, and standard deviation being 1.7836, 1.7250, and 0.9239, respectively. All models offer minimum predictions that are in agreement. In contrast, the XGBA model achieves the maximum value (4.0875), which is closest to the measured maximum, indicating good model performance under peak efficiency conditions. The outcome of this study is that the hybrid models are helpful for solar energy systems as they not only depict the normal performance but also the peak outputs, and therefore, the operators can use the energy scheduling to achieve maximum revenue and minimize financial uncertainty by being able to predict the periods of high energy generation and efficiency.

Table 5. Statistical metrics used to compare the top-performing models.PhaseTargetModelsPropertiesMaxMinMeanMedianSt. DevTestnoptMeasured10.120.14.31823.89502.8322RBBA9.5930.0904.37123.89552.9352RFBA10.0350.1064.27593.89382.7816XGBA10.2840.1044.30793.90062.8234PCEMeasured4.10.151.78361.72500.9239RBBA4.23380.15001.79191.66940.9696RFBA4.04780.151.78661.76200.9254XGBA4.08750.15001.81091.82280.9340

Figures 9 and 10 show the significant prediction errors for each model and the NOPT and PCE targets. The XGBA model has the narrowest line through the origin, suggesting that almost all of its predictions are very close to the actual values. To be more specific, in Fig. 9, the error of the XGBA model is very close to zero, which is the range of -5 to + 5, while other models have errors in much wider ranges. This high accuracy enables the prediction of any number of peak operating times and power conversion efficiency with surprising accuracy. From an investor’s point of view, such accurate predictions are extremely valuable: they enable solar power investors and managers to estimate likely energy output and efficiency with a high degree of confidence, enabling them to better allocate resources, plan maintenance activities more effectively, and predict revenues more accurately. As a result, models such as XGBA can reduce financial risk, increase profit potential, and enhance decision-making in solar energy projects.

Fig. 9. Histogram showing the distribution of prediction errors for the selected models.

Fig. 10. Line plot of prediction errors for the selected models.Table 6 presents the results of Dunn’s post hoc test for pairwise model comparisons alongside the Durbin–Watson (DW) statistics to assess the reliability and independence of model residuals. Dunn’s post hoc test is a non-parametric method used for multiple pairwise comparisons following a Kruskal–Wallis test and was selected because the performance metrics, such as RMSE and R², do not necessarily follow a normal distribution. This test evaluates whether the differences in model performance are statistically significant. The Durbin–Watson statistic measures autocorrelation in the residuals, with values ranging from 0 to 4; values close to 2 indicate no significant autocorrelation, values below 2 suggest positive autocorrelation, and values above 2 indicate negative autocorrelation. In this study, the XGBA model shows DW values of 1.9274 for both NOPT and PCE, which is very close to 2, confirming that the residuals are statistically independent. This independence implies that the model predictions are reliable and not biased by systematic correlation in the data. In contrast, several single or hybrid models exhibit DW values substantially below or above 2, indicating residual correlation and potentially less reliable predictions. Together, Dunn’s post hoc test and DW statistics provide a rigorous assessment of model validity: the former confirms that XGBA’s performance differences are statistically robust, while the latter demonstrates that the residuals are independent, supporting the model’s robustness and generalization capability.

Table 6. Results of dunn’s post hoc test for pairwise model comparisons.TargetModelsdw_statisticalConclusionPCEXGBA1.9274No significant autocorrelation (ideal)XGB1.1333Strong positive autocorrelation (residuals are correlated)RFBA1.1548Strong positive autocorrelation (residuals are correlated)RF1.1676Strong positive autocorrelation (residuals are correlated)RBBA1.1554Strong positive autocorrelation (residuals are correlated)RBF1.1222Strong positive autocorrelation (residuals are correlated)noptXGBA1.9274No significant autocorrelation (ideal)XGB2.0649No significant autocorrelation (ideal)RFBA2.2966No significant autocorrelation (ideal)RF2.0243No significant autocorrelation (ideal)RBBA1.8326No significant autocorrelation (ideal)RBF1.8417No significant autocorrelation (ideal)

Table 7 presents the confidence intervals (CIs) for RMSE and MDAPE across the training, validation, and testing phases for all single and hybrid models, for both target variables, NOPT and PCE. These intervals provide an explicit measure of prediction uncertainty and offer insight into the statistical stability and reliability of each modeling framework beyond pointwise performance metrics. During the training phase, all models exhibit relatively narrow confidence intervals, indicating stable learning and limited dispersion in prediction errors. For NOPT prediction, the hybrid models—particularly XGBA—show comparatively tighter RMSE and MDAPE intervals, suggesting more consistent error distributions than single-model counterparts. A similar trend is observed for PCE, where hybrid models demonstrate reduced uncertainty bounds, reflecting improved robustness during model fitting. During the validation phase, the confidence intervals slightly widen across all models, as expected, since predictions are evaluated on unseen data used for hyperparameter tuning.

Nevertheless, hybrid models maintain narrower CI ranges than single models for both RMSE and MDAPE. This behavior indicates enhanced generalization capability and reduced sensitivity to data variability, reinforcing the effectiveness of hybridization strategies in controlling prediction uncertainty. In the testing phase, confidence intervals widen further, reflecting realistic uncertainty under fully unseen data conditions. Despite this, the XGBA model consistently exhibits balanced, relatively compact CI ranges for both NOPT and PCE, demonstrating reliable performance and controlled error dispersion. The comparable CI widths across training, validation, and testing subsets indicate the absence of severe overfitting and confirm the statistical stability of the proposed hybrid framework.

Figure 11 shows the Taylor diagram for the difference between measured and predicted values. The RBF-based models outperform the other approaches for both nopt and PCE, achieving the highest correlation coefficients and standard deviations closest to the measured data, which results in the lowest overall error. Tree-based and ensemble models (RF, XGB, and their variants) capture general trends but show noticeable variance mismatch and reduced correlation, especially for PCE. Overall, the Taylor diagrams confirm the RBF model’s superior robustness and generalization, particularly in representing the system’s nonlinear behavior.

Fig. 11. Taylor diagram for the difference between measured and predicted values.Figures 12 and 13 present a combined sensitivity analysis assessing the impact of the input variables on the output variables NOPT and PCE, respectively. As per Fig. 12, the FAST sensitivity analysis identifies as the variable with the most significant influence on NOPT, exhibiting an 1 value of 1.45, while is the leading variable for PCE predictions with an 1 of 2.2. This indicates that the extreme values of generated electrical energy strongly govern both the optimal timing of peak operation and the PV system’s efficiency. In physical terms, reflects periods of minimal energy generation, which critically limit the identification of optimal peak times, whereas corresponds to the highest achievable energy output, directly affecting power conversion efficiency. Furthermore, the accumulated local effects (ALE) study portrayed in Fig. 13 reveals the possible influence of each variable on the model outputs, along with lower and upper confidence intervals for NOPT and PCE predictions. These analyses highlight that, in addition to and , variables such as also significantly contribute, reflecting the direct impact of solar irradiance on system performance. Physically, higher irradiance levels increase energy production and efficiency, while variations in the minimum and maximum energy values determine the system’s operational window and efficiency ceiling.

The different ranking of feature importance between FAST and ALE arises from their distinct perspectives: FAST captures global variance contributions, while ALE highlights local and conditional effects. For example, shows the greatest influence on NOPT in FAST because variations in minimal energy generation dominate overall prediction variance, whereas ALE indicates that has stronger local effects on PCE, reflecting its direct impact on peak conversion efficiency under high irradiance conditions. These findings not only unveil the most sensitive parameters but also provide actionable insights for PV system operators: by understanding which energy extremes and irradiance levels most strongly affect system performance, resource allocation, system design, and maintenance schedules can be optimized to maximize energy yield. This connection between model sensitivity and real-world PV behavior enhances the interpretability and practical relevance of the predictive framework.

Fig. 12FAST Sensitivity analysis depicting the effect of input variables on the model output.

Fig. 13. Sensitivity analysis for the impact of input variables on the model’s output based on the ALE method.

Discussion

Limitations and future work

Despite the high predictive accuracy and robustness of the proposed hybrid model, several limitations exist. First, the study primarily relies on historical PV systems and meteorological data, which may limit model performance under entirely new climatic scenarios or rapidly changing environmental conditions. Second, while the hybrid framework demonstrates strong accuracy and interpretability, exploring more advanced deep learning architectures, such as Transformers or Graph Neural Networks, was beyond the current scope. Third, uncertainty quantification was performed using standard evaluation metrics, but probabilistic forecasting and real-time adaptive prediction were not fully addressed.

Future work includes:

Integration of advanced reinforcement and deep learning-based hybrid models (e.g., Transformers, GNNs) to capture complex temporal and spatial dependencies in PV systems.
Development of probabilistic and real-time adaptive forecasting approaches to improve reliability under dynamic environmental conditions.
Expansion of the framework to include larger and more diverse PV datasets, enhancing generalization and practical applicability.
Further exploration of explainable AI techniques to deepen physical insight and improve transparency for operational decision-making. These directions aim to enhance both the predictive performance and practical deployment of hybrid PV forecasting models in real-world energy systems.

In addition, the current dataset and modeling framework do not explicitly account for environmental disturbances, such as dust accumulation, humidity, partial shading, or soiling, which are known to influence photovoltaic system performance in real-world deployments. The absence of such factors may limit the generalizability of the predictions to field conditions where these disturbances occur. Nonetheless, the selected input variables, including , , , , , , and , indirectly reflect cumulative environmental effects on system performance. For example, variability in irradiance and energy output may partially capture the influence of transient shading or atmospheric conditions. To enhance applicability in operational settings, future studies should integrate additional environmental monitoring data, including humidity levels, dust deposition rates, soiling factors, and shading patterns. Incorporating these features into hybrid machine learning models can improve predictive robustness, reduce uncertainty under extreme or variable conditions, and increase the reliability of NOPT and PCE forecasts for real-world PV systems. This limitation does not diminish the current study’s contribution, as the framework provides a robust baseline for forecasting PV system performance under nominal environmental conditions and can readily be extended to include more complex environmental variables in subsequent research.

Practical application scenario in PV systems

Beyond numerical accuracy, the proposed forecasting framework can be directly integrated into the operational workflow of real PV plants as a decision-support tool. In a practical deployment scenario, the trained model can be embedded within a plant energy management system to provide day-ahead or intra-day predictions of NOPT and PCE based on real-time or forecasted environmental inputs. Specifically, NOPT predictions enable operators to identify time windows during which the PV system operates at maximum effectiveness, supporting informed scheduling of load management, grid interaction, and energy storage charging or discharging. Accurate PCE forecasting allows continuous assessment of system health and performance degradation, facilitating early detection of faults, soiling, or suboptimal operating conditions. When predicted PCE deviates from expected values under similar irradiance and energy conditions, maintenance actions can be prioritized proactively.

Furthermore, the sensitivity analysis results provide actionable physical insight for system optimization. The dominance of variables such as and indicates that energy extremes critically influence both operational timing and efficiency, suggesting that operational strategies should focus on mitigating low-energy periods and maximizing utilization during high-energy intervals. This information can guide inverter control strategies, energy storage dispatch, and plant design adjustments, such as panel orientation or capacity planning. From an economic perspective, integrating the proposed model into PV plant operation can reduce uncertainty in energy yield forecasting, improve scheduling efficiency, and support more reliable participation in energy markets. The framework is scalable and adaptable to different plant sizes and climatic regions, making it suitable for both utility-scale PV plants and distributed solar installations. As a result, the proposed approach bridges the gap between high-accuracy data-driven modeling and practical, real-world PV system management.

Comparison with published papers

Table 8 compares the proposed XGBA model with recent hybrid PV forecasting studies. Xu et al.^47^ combined EEMD decomposition, XGBoost, LSTM, and Snake Optimization for PV power prediction, achieving reduced errors but focusing only on power series without addressing optimal operating points or efficiency. Renold et al.^19^ integrated TCN, LSTM, and GRU networks for short-term PV forecasting, improving accuracy and computational efficiency. Wang et al.^48^ applied a stacking strategy of gradient-boosted and deep networks for solar irradiance and generation, achieving R ≈ 0.99. Tanyıldızı and Ağır^49^ combined LSTM with SVM for very short-term PV forecasting, reporting R ≈ 0.9823 and RMSE ≈ 0.0300, demonstrating hybridization benefits over single models. The proposed XGBA model surpasses these approaches, achieving R² up to ~ 0.997 and very low RMSE for both NOPT and PCE. Unlike prior studies, it forecasts both optimal peak times and power conversion efficiency, offering broader applicability, robust generalization, and low prediction uncertainty.

Table 8. Comparison between the presented and published studies.StudyModel / hybrid approachTarget / forecast taskReported metrics (R^2^ / RMSE)Notes (Comparison)Xu et al.^47^Hybrid of EEMD decomposition + XGBoost + LSTM + Snake OptimizationPV power forecastingSignificant error reduction; hybrid outperforms individual modelsHybrid fusion of low- and high-frequency components; better than PSO and SSA variants (PubMed)Renold et al.^19^Hybrid ML combining advanced ML predictors (TCN/LSTM/GRU)Short-term PV powerImproves accuracy and runtime efficiencyDemonstrates improved short-term forecasts with hybrid deep ML (sciencedirect.com)Wang et al.^48^Stacking of gradient-boosted and deep networksShort-term solar irradiance / generationR ≈ 0.99 for hybrid BiLSTM configurationsHybrid stacking enhances nonlinear and temporal pattern learning (MDPI)Tanyıldızı and Ağır^49^Combines LSTM with SVMVery-short-term PV forecastingR ≈ 0.9823 and RMSE ≈ 0.0300Hybrid outperforms individual LSTM and SVM models (DergiPark)Proposed XGBA (This Study)Hybrid gradient boosting + adaptive aggregationNOPT & PCE forecastingR² up to ~ 0.997; very low RMSEHighest accuracy across multiple targets; robust generalization and low uncertainty

Conclusion

This study introduced a framework to accurately predict solar energy parameters, including the number of optimal peak operating times (NOPT) and power conversion efficiency (PCE), using hybrid machine learning models optimized through the Bat Algorithm (BAT). Based on hyperparameter tuning, the performance of each model, including Radial Basis Function (RBF), eXtreme Gradient Boosting Regression (XGBR), and Random Forest Regression (RFR), was improved by exploring the parameter space and achieving optimal predictive results with fewer iterations of the algorithm. The dataset consisted of 305 records with seven features, including solar irradiance (Ap, Amin, Amax), electrical energy output (Ep, Emin, Emax), and normalized energy yield (nyield), which collectively represented the environmental and operational conditions that influence solar energy systems. Numerical evaluation of the hybrid models highlighted the superiority of the XGBA model. Specifically, XGBA reduced the RMSE of the single XGB model in predicting NOPT by 40.155% and decreased the U95 value for PCE by 135.58%, demonstrating that this model was more accurate, stable, and robust across both targets. These enhancements suggested that the hybrid system was capable of predicting both average and extreme weather conditions, supporting effective management and scheduling of solar energy. In addition, three sensitivity analysis procedures were used to determine the effects of input variables on the models’ outputs. The FAST sensitivity analysis identified as the most crucial variable for NOPT, whereas for PCE, the highest first-order effect (1) corresponded to and the highest total effect (ST) to . These outcomes provided valuable insights regarding the drivers predominantly affecting solar energy performance and enabled informed decision-making. In general terms, the union of hybrid modeling, BAT optimization, and rigorous sensitivity analysis provided a stable, understandable, and highly performing system for predicting solar energy parameters and supporting strategic planning for solar energy projects.‎.

Bibliography12

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Zhang, W. Main Contributions, Applications and Future Prospect of PV, In MATEC Web of Conferences, vol. 386, p. 3012. (2023).
2El-Din, H. A., Elkelawy, M. & Yu-Sheng, Z. HCCI engines combustion of CNG fuel with DME and H 2 additives. SAE Tech. Paper, (2010).
3Aboubakr, M. H., Elkelawy, M., Bastawissi, H. A. E. & El-Tohamy, A. R. A technical survey on using oxyhydrogen with biodiesel/diesel blend for homogeneous charge compression ignition engine. J. Eng. Res, 8, 1, (2024).
4Lundberg, S. M. & Lee, S. I. A unified approach to interpreting model predictions. Adv Neural Inf. Process. Syst., 30, (2017).
5Ribeiro, M. T., Singh, S. & Guestrin, C. ‘ Why should i trust you?’ Explaining the predictions of any classifier, In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. (2016).
6Heshmati, R. A. A., Alavi, A. H., Keramati, M. & Gandomi, A. H. A radial basis function neural network approach for compressive strength prediction of stabilized soil, In Road Pavement Material Characterization and Rehabilitation: Selected Papers from the 2009 Geo Hunan International Conference, pp. 147–153.10.1061/41043(350)20 (2009).
7Ke, G. et al. Lightgbm: A highly efficient gradient boosting decision tree. Adv Neural Inf. Process. Syst, 30, (2017).
8Breiman, L., Friedman, J., Olshen, R. & Stone, C. Classification and regression trees–crc press. Boca Raton Florida, (1984).