Real-time Electrical Power Prediction in a Combined Cycle Power Plant

Jesus L. Lobo; Igor Ballesteros; Izaskun Oregi; Javier Del Ser

arXiv:1907.11653·eess.SP·August 6, 2019

Real-time Electrical Power Prediction in a Combined Cycle Power Plant

Jesus L. Lobo, Igor Ballesteros, Izaskun Oregi, Javier Del Ser

PDF

TL;DR

This paper introduces an incremental learning approach for real-time electrical power prediction in combined cycle power plants, leveraging streaming data to improve efficiency and performance over traditional batch methods.

Contribution

It presents a novel incremental learning framework for continuous power prediction, addressing the limitations of batch models in dynamic, real-time environments.

Findings

01

Streaming regressors outperform batch models in processing time.

02

Incremental models adapt better to changing environmental conditions.

03

Proposed approach reduces computational costs significantly.

Abstract

The prediction of electrical power in combined cycle power plants is a key challenge in the electrical power and energy systems field. This power output can vary depending on environmental variables, such as temperature, pressure, and humidity. Thus, the business problem is how to predict the power output as a function of these environmental conditions in order to maximize the profit. The research community has solved this problem by applying machine learning techniques and has managed to reduce the computational and time costs in comparison with the traditional thermodynamical analysis. Until now, this challenge has been tackled from a batch learning perspective in which data is assumed to be at rest, and where models do not continuously integrate new information into already constructed models. We present an approach closer to the Big Data and Internet of Things paradigms in which…

Tables15

Table 1. Table 1: Input and target variables of the dataset.

VARIABLES	ABBREVIATIONS	DESCRIPTIONS	RANGES	TYPES
Ambient Temperature	AT	Measured in whole degrees in Celsius	$1.81 - 37.11$	Input
Atmospheric Pressure	AP	Measured in units of milibars	$992.89 - 1033.30$	Input
Relative Humidity	RH	Measured as a percentage	$25.56 - 100.16$	Input
Vacuum (Exhaust Steam Pressure)	V	Measured in cm Hg	$25.36 - 81.56$	Input
Full Load Electrical Power Output	PE	Measured in megawatts	$420.26 - 495.76$	Target

Table 2. Table 2: The experimental benchmark for the comparison of the SRs.

		PREPARATORY SIZES (% OF THE DATASET)
		$5 %$	$20 %$
FEATURE SELECTION	True	Exp1	Exp3
FEATURE SELECTION	False	Exp2	Exp4

Table 3. Table 3: Results of the experiment 1: feature selection with a 5 % percent 5 5\% of preparatory instances. Note that RMSE = MAE because all differences are equal.

SR	MSE	RMSE	MAE	$𝑹^{𝟐}$	TIME (s)
PAR	0.007 $\pm$ 0.011	0.062 $\pm$ 0.049	0.062 $\pm$ 0.049	0.872 $\pm$ 0.070	2.97 $\pm$ 0.71
SGDR	0.008 $\pm$ 0.013	0.070 $\pm$ 0.056	0.070 $\pm$ 0.056	0.829 $\pm$ 0.123	2.31 $\pm$ 0.38
MLPR	0.005 $\pm$ 0.007	0.055 $\pm$ 0.041	0.055 $\pm$ 0.041	0.901 $\pm$ 0.011	9.23 $\pm$ 7.78
RHT	0.004 $\pm$ 0.006	0.052 $\pm$ 0.039	0.052 $\pm$ 0.039	0.900 $\pm$ 0.024	2.55 $\pm$ 0.55
RHAT	0.005 $\pm$ 0.007	0.054 $\pm$ 0.040	0.054 $\pm$ 0.040	0.893 $\pm$ 0.024	3.87 $\pm$ 0.75
MFR	0.021 $\pm$ 0.029	0.109 $\pm$ 0.091	0.109 $\pm$ 0.091	0.592 $\pm$ 0.203	107.06 $\pm$ 49.95
MTR	0.019 $\pm$ 0.027	0.104 $\pm$ 0.086	0.104 $\pm$ 0.086	0.629 $\pm$ 0.186	1.344 $\pm$ 0.18

Table 4. Table 4: Results of the experiment 2: no feature selection with a 5 % percent 5 5\% of preparatory instances. Note that RMSE = MAE because all differences are equal.

SR	MSE	RMSE	MAE	$𝑹^{𝟐}$	TIME (s)
PAR	0.006 $\pm$ 0.009	0.057 $\pm$ 0.044	0.057 $\pm$ 0.044	0.885 $\pm$ 0.013	3.29 $\pm$ 1.50
SGDR	0.008 $\pm$ 0.012	0.069 $\pm$ 0.053	0.069 $\pm$ 0.053	0.821 $\pm$ 0.119	2.48 $\pm$ 0.90
MLPR	0.005 $\pm$ 0.007	0.055 $\pm$ 0.041	0.055 $\pm$ 0.041	0.897 $\pm$ 0.020	14.76 $\pm$ 12.51
RHT	0.005 $\pm$ 0.007	0.052 $\pm$ 0.040	0.052 $\pm$ 0.040	0.876 $\pm$ 0.047	3.55 $\pm$ 0.70
RHAT	0.005 $\pm$ 0.007	0.054 $\pm$ 0.04	0.054 $\pm$ 0.040	0.884 $\pm$ 0.038	5.42 $\pm$ 0.94
MFR	0.004 $\pm$ 0.007	0.042 $\pm$ 0.039	0.042 $\pm$ 0.039	0.922 $\pm$ 0.041	125.18 $\pm$ 60.71
MTR	0.012 $\pm$ 0.021	0.076 $\pm$ 0.073	0.076 $\pm$ 0.073	0.754 $\pm$ 0.171	1.49 $\pm$ 0.23

Table 5. Table 5: Results of the experiment 3: feature selection with a 20 % percent 20 20\% of preparatory instances. Note that RMSE = MAE because all differences are equal.

SR	MSE	RMSE	MAE	$𝑹^{𝟐}$	TIME (s)
PAR	0.005 $\pm$ 0.007	0.055 $\pm$ 0.041	0.055 $\pm$ 0.041	0.904 $\pm$ 0.013	2.40 $\pm$ 0.90
SGDR	0.005 $\pm$ 0.007	0.056 $\pm$ 0.041	0.056 $\pm$ 0.041	0.901 $\pm$ 0.021	1.86 $\pm$ 0.58
MLPR	0.004 $\pm$ 0.007	0.052 $\pm$ 0.039	0.052 $\pm$ 0.039	0.912 $\pm$ 0.016	9.12 $\pm$ 8.83
RHT	0.004 $\pm$ 0.006	0.050 $\pm$ 0.037	0.050 $\pm$ 0.037	0.914 $\pm$ 0.010	2.07 $\pm$ 0.28
RHAT	0.004 $\pm$ 0.007	0.052 $\pm$ 0.039	0.052 $\pm$ 0.039	0.909 $\pm$ 0.010	3.48 $\pm$ 0.57
MFR	0.022 $\pm$ 0.030	0.113 $\pm$ 0.092	0.113 $\pm$ 0.092	0.570 $\pm$ 0.205	94.98 $\pm$ 42.94
MTR	0.024 $\pm$ 0.031	0.120 $\pm$ 0.093	0.120 $\pm$ 0.093	0.539 $\pm$ 0.178	1.11 $\pm$ 0.16

Table 6. Table 6: Results of the experiment 4: no feature selection with a 20 % percent 20 20\% of preparatory instances. Note that RMSE = MAE because all differences are equal.

SR	MSE	RMSE	MAE	$𝑹^{𝟐}$	TIME (s)
PAR	0.006 $\pm$ 0.010	0.057 $\pm$ 0.044	0.057 $\pm$ 0.044	0.890 $\pm$ 0.010	2.38 $\pm$ 0.79
SGDR	0.005 $\pm$ 0.007	0.055 $\pm$ 0.040	0.055 $\pm$ 0.040	0.901 $\pm$ 0.014	1.79 $\pm$ 0.48
MLPR	0.004 $\pm$ 0.006	0.051 $\pm$ 0.037	0.051 $\pm$ 0.037	0.917 $\pm$ 0.011	7.96 $\pm$ 7.12
RHT	0.004 $\pm$ 0.006	0.048 $\pm$ 0.036	0.048 $\pm$ 0.036	0.917 $\pm$ 0.023	3.12 $\pm$ 0.43
RHAT	0.005 $\pm$ 0.007	0.055 $\pm$ 0.041	0.055 $\pm$ 0.041	0.892 $\pm$ 0.039	5.16 $\pm$ 1.18
MFR	0.003 $\pm$ 0.006	0.036 $\pm$ 0.035	0.036 $\pm$ 0.035	0.940 $\pm$ 0.053	109.65 $\pm$ 42.04
MTR	0.011 $\pm$ 0.019	0.075 $\pm$ 0.070	0.075 $\pm$ 0.070	0.776 $\pm$ 0.126	1.21 $\pm$ 0.21

Table 7. Table 7: Hyper-parameter tuning results for SRs in the experiment 1.

SR	PARAMETERS	VALUES
PAR	C	0.05
SGDR	alpha	0.1/0.01
	loss	epsilon_insensitive
	penalty	L1/L2
	learning_rate	constant/optimal
MLPR	hidden_layer_sizes	(50,50)
	activation	relu
	solver	adam/sgd
	learning_rate	constant, invscaling, adaptive
	learning_rate_init	0.005/0.001
	alpha	0.000001-0.000000001
RHT	grace_period	200
	split_confidence	0.0000001
	tie_threshold	0.05
	leaf_prediction	perceptron
RHAT	grace_period	200
	split_confidence	0.0000001
	tie_threshold	0.05
	leaf_prediction	perceptron
	delta (ADWIN detector)	0.002
MTR	max_depth	10-100
MTR	min_samples_split	10
MFR	max_depth	10-80
	min_samples_split	10
	n_estimators	50/100

Table 8. Table 8: Hyper-parameter tuning results for SRs in the experiment 2.

SR	PARAMETERS	VALUES
PAR	C	0.5/1.0
SGDR	alpha	0.00001-0.1
	loss	epsilon_insensitive
	penalty	L1/L2
	learning_rate	constant/optimal/invscaling
MLPR	hidden_layer_sizes	(50,50)/(100,100)
	activation	relu/tanh/identity
	solver	adam/sgd
	learning_rate	constant, invscaling, adaptive
	learning_rate_init	0.0005-0.05
	alpha	0.00001-0.000000001
RHT	grace_period	200
	split_confidence	0.0000001
	tie_threshold	0.05
	leaf_prediction	perceptron
RHAT	grace_period	200
	split_confidence	0.0000001
	tie_threshold	0.05
	leaf_prediction	perceptron
	delta (ADWIN detector)	0.002
MTR	max_depth	20-90
MTR	min_samples_split	2/5/10
MFR	max_depth	20-90
	min_samples_split	2/5
	n_estimators	50/100

Table 9. Table 9: Hyper-parameter tuning results for SRs in the experiment 3.

SR	PARAMETERS	VALUES
PAR	C	0.01
SGDR	alpha	$0.001$
	loss	epsilon_insensitive
	penalty	elasticnet/L1
	learning_rate	constant
MLPR	hidden_layer_sizes	$(50) / (100)$
	activation	relu
	solver	adam/sgd
	learning_rate	constant, invscaling, adaptive
	learning_rate_init	$0.005$
	alpha	$0.00001, 0.000001$
RHT	grace_period	$200$
	split_confidence	$0.0000001$
	tie_threshold	$0.05$
	leaf_prediction	perceptron
RHAT	grace_period	$200$
	split_confidence	$0.0000001$
	tie_threshold	$0.05$
	leaf_prediction	perceptron
	delta (ADWIN detector)	$0.002$
MTR	max_depth	$20 - 60$
MTR	min_samples_split	$10$
MFR	max_depth	$20 - 60$
	min_samples_split	$10$
	n_estimators	$50 / 100$

Table 10. Table 10: Hyper-parameter tuning results for SRs in the experiment 4.

SR	PARAMETERS	VALUES
PAR	C	$0.5 / 1.0$
SGDR	alpha	$0.001 - 0.01$
	loss	epsilon_insensitive
	penalty	L1/L2
	learning_rate	constant/optimal
MLPR	hidden_layer_sizes	$(100) / (500) / (50, 50) / (100, 100)$
	activation	relu/tanh
	solver	adam/sgd
	learning_rate	constant, invscaling, adaptive
	learning_rate_init	$0.0005 - 0.05$
	alpha	$0.001 - 0.000000001$
RHT	grace_period	$200$
	split_confidence	$0.0000001$
	tie_threshold	$0.05$
	leaf_prediction	perceptron
RHAT	grace_period	$200$
	split_confidence	$0.0000001$
	tie_threshold	$0.05$
	leaf_prediction	perceptron
	delta (ADWIN detector)	0.002
MTR	max_depth	$40 - 100$
MTR	min_samples_split	$5$
MFR	max_depth	$30 - 100$
	min_samples_split	$5$
	n_estimators	$50 / 100$

Table 11. Table 11: Feature selection results in each experiment. Those selected features are represented with y (yes), the rest with n (no).

		FEATURES
		AT	AP	RH	V
EXPERIMENTS	1	y	n	n	y
EXPERIMENTS	2	y	n	n	y

Table 12. Table 12: Results of the Tukey’s range test for experiment 1.

GROUP1	GROUP2	MEAN DIFF.	LOWER	UPPER	REJECT
MFR	MLPR	0.259	0.1317	0.3864	True
MFR	MTR	-0.0108	-0.1382	0.1165	False
MFR	PAR	0.2375	0.1102	0.3649	True
MFR	RHAT	0.2461	0.1188	0.3735	True
MFR	RHT	0.2546	0.1272	0.3819	True
MFR	SGDR	0.2037	0.0764	0.3311	True
MLPR	MTR	-0.2699	-0.3972	-0.1425	True
MLPR	PAR	-0.0215	-0.1489	0.1058	False
MLPR	RHAT	-0.0129	-0.1403	0.1144	False
MLPR	RHT	-0.0045	-0.1318	0.1229	False
MLPR	SGDR	-0.0553	-0.1827	0.072	False
MTR	PAR	0.2484	0.121	0.3757	True
MTR	RHAT	0.257	0.1296	0.3843	True
MTR	RHT	0.2654	0.138	0.3928	True
MTR	SGDR	0.2146	0.0872	0.3419	True
PAR	RHAT	0.0086	-0.1188	0.1359	False
PAR	RHT	0.017	-0.1103	0.1444	False
PAR	SGDR	-0.0338	-0.1612	0.0936	False
RHAT	RHT	0.0084	-0.1189	0.1358	False
RHAT	SGDR	-0.0424	-0.1697	0.085	False
RHT	SGDR	-0.0508	-0.1782	0.0765	False

Table 13. Table 13: Results of the Tukey’s range test for experiment 2.

GROUP1	GROUP2	MEAN DIFF.	LOWER	UPPER	REJECT
MFR	MLPR	-0.0223	-0.1304	0.0859	False
MFR	MTR	-0.1953	-0.3035	-0.0871	True
MFR	PAR	-0.0381	-0.1463	0.07	False
MFR	RHAT	-0.0428	-0.151	0.0653	False
MFR	RHT	-0.0259	-0.1341	0.0823	False
MFR	SGDR	-0.1408	-0.249	-0.0326	True
MLPR	MTR	-0.173	-0.2812	-0.0649	True
MLPR	PAR	-0.0159	-0.1241	0.0923	False
MLPR	RHAT	-0.0206	-0.1287	0.0876	False
MLPR	RHT	-0.0036	-0.1118	0.1045	False
MLPR	SGDR	-0.1185	-0.2267	-0.0103	True
MTR	PAR	0.1572	0.049	0.2653	True
MTR	RHAT	0.1525	0.0443	0.2606	True
MTR	RHT	0.1694	0.0612	0.2776	True
MTR	SGDR	0.0545	-0.0537	0.1627	False
PAR	RHAT	-0.0047	-0.1129	0.1035	False
PAR	RHT	0.0122	-0.0959	0.1204	False
PAR	SGDR	-0.1026	-0.2108	0.0055	False
RHAT	RHT	0.0169	-0.0913	0.1251	False
RHAT	SGDR	-0.0979	-0.2061	0.0102	False
RHT	SGDR	-0.1149	-0.223	-0.0067	True

Table 14. Table 14: Results of the Tukey’s range test for experiment 3.

GROUP1	GROUP2	MEAN DIFF.	LOWER	UPPER	REJECT
MFR	MLPR	0.2913	0.1947	0.3879	True
MFR	MTR	-0.0508	-0.1474	0.0458	False
MFR	PAR	0.2892	0.1926	0.3859	True
MFR	RHAT	0.2955	0.1988	0.3921	True
MFR	RHT	0.293	0.1963	0.3896	True
MFR	SGDR	0.2874	0.1908	0.3841	True
MLPR	MTR	-0.3421	-0.4387	-0.2455	True
MLPR	PAR	-0.0021	-0.0987	0.0945	False
MLPR	RHAT	0.0042	-0.0925	0.1008	False
MLPR	RHT	0.0016	-0.095	0.0983	False
MLPR	SGDR	-0.0039	-0.1005	0.0927	False
MTR	PAR	0.34	0.2434	0.4367	True
MTR	RHAT	0.3463	0.2496	0.4429	True
MTR	RHT	0.3438	0.2471	0.4404	True
MTR	SGDR	0.3382	0.2416	0.4348	True
PAR	RHAT	0.0062	-0.0904	0.1029	False
PAR	RHT	0.0037	-0.0929	0.1004	False
PAR	SGDR	-0.0018	-0.0984	0.0948	False
RHAT	RHT	-0.0025	-0.0991	0.0941	False
RHAT	SGDR	-0.0081	-0.1047	0.0886	False
RHT	SGDR	-0.0055	-0.1022	0.0911	False

Table 15. Table 15: Results of the Tukey’s range test for experiment 4.

GROUP1	GROUP2	MEAN DIFF.	LOWER	UPPER	REJECT
MFR	MLPR	-0.0352	-0.1058	0.0355	False
MFR	MTR	-0.2079	-0.2786	-0.1372	True
MFR	PAR	-0.0549	-0.1255	0.0158	False
MFR	RHAT	-0.0494	-0.1201	0.0212	False
MFR	RHT	-0.0307	-0.1013	0.04	False
MFR	SGDR	-0.0604	-0.131	0.0103	False
MLPR	MTR	-0.1727	-0.2434	-0.1021	True
MLPR	PAR	-0.0197	-0.0904	0.051	False
MLPR	RHAT	-0.0143	-0.0849	0.0564	False
MLPR	RHT	0.0045	-0.0662	0.0752	False
MLPR	SGDR	-0.0252	-0.0959	0.0455	False
MTR	PAR	0.153	0.0824	0.2237	True
MTR	RHAT	0.1585	0.0878	0.2291	True
MTR	RHT	0.1772	0.1066	0.2479	True
MTR	SGDR	0.1475	0.0769	0.2182	True
PAR	RHAT	0.0054	-0.0652	0.0761	False
PAR	RHT	0.0242	-0.0465	0.0949	False
PAR	SGDR	-0.0055	-0.0762	0.0652	False
RHAT	RHT	0.0188	-0.0519	0.0894	False
RHAT	SGDR	-0.0109	-0.0816	0.0597	False
RHT	SGDR	-0.0297	-0.1004	0.0409	False

Equations8

M A E = \frac{1}{n} j = 1 \sum n ∣ y_{j} - \overset{y}{^}_{j} ∣

M A E = \frac{1}{n} j = 1 \sum n ∣ y_{j} - \overset{y}{^}_{j} ∣

R M S E = \frac{1}{n} j = 1 \sum n (y_{j} - \overset{y}{^}_{j})^{2}

R M S E = \frac{1}{n} j = 1 \sum n (y_{j} - \overset{y}{^}_{j})^{2}

M S E = \frac{1}{n} j = 1 \sum n (y_{j} - \overset{y}{^}_{j})^{2}

M S E = \frac{1}{n} j = 1 \sum n (y_{j} - \overset{y}{^}_{j})^{2}

R^{2} = 1 - \frac{\sum _{j = 1}^{n} ( y _{j} - y ^ _{j} ) ^{2}}{\sum _{j = 1}^{n} ( y _{j} - y ˉ _{j} ) ^{2}}

R^{2} = 1 - \frac{\sum _{j = 1}^{n} ( y _{j} - y ^ _{j} ) ^{2}}{\sum _{j = 1}^{n} ( y _{j} - y ˉ _{j} ) ^{2}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Real-time Electrical Power Prediction in a Combined Cycle Power Plant

Jesus L. Lobo

Igor Ballesteros

Izaskun Oregi

Javier Del Ser

TECNALIA, Parque Científico y Tecnológico de Bizkaia, Astondo Bidea, Edificio 700. E-48160 Derio (Bizkaia), Spain.

University of the Basque Country UPV/EHU, 48013 Bilbao, Spain

Basque Center for Applied Mathematics (BCAM), 48009 Bilbao, Spain

Abstract

The prediction of electrical power in combined cycle power plants is a key challenge in the electrical power and energy systems field. This power output can vary depending on environmental variables, such as temperature, pressure, and humidity. Thus, the business problem is how to predict the power output as a function of these environmental conditions in order to maximize the profit. The research community has solved this problem by applying machine learning techniques and has managed to reduce the computational and time costs in comparison with the traditional thermodynamical analysis. Until now, this challenge has been tackled from a batch learning perspective in which data is assumed to be at rest, and where models do not continuously integrate new information into already constructed models. We present an approach closer to the Big Data and Internet of Things paradigms in which data is arriving continuously and where models learn incrementally, achieving significant enhancements in terms of data processing (time, memory and computational costs), and obtaining competitive performances. This work compares and examines the hourly electrical power prediction of several streaming regressors, and discusses about the best technique in terms of time processing and performance to be applied on this streaming scenario.

keywords:

Electrical power prediction , combined cycle power plant , stream learning , online learning regression

1 Introduction

1.1 The Electrical Power Prediction for Combined Cycle Power Plants

The efficiency in combined cycle power plants (CCPPs) is a key issue, as it is revealed in a recent report [1] where shows that in the next decade, the number of projects involving combined cycle technology will increase by a $3.1\%$ , and this estimation is based on the high efficiency of CCPPs. The electrical power prediction in CCPPs encompasses numerous factors that should be considered to achieve an accurate estimation. The operators of a power grid often predict the power demand based on historical data and environmental factors, such as temperature, pressure, and humidity. Then, they compare these predictions with available resources, such as coal, natural gas, nuclear, solar, wind, or hydro power plants. Power generation technologies (e.g. solar and wind) are highly dependent on environmental conditions, and all generation technologies are subject to planned and unplanned maintenance. Thus, the challenge for a power grid operator is how to handle a shortfall in available resources versus actual demand. The power output of a peaker power plant varies depending on environmental conditions, so the business problem is predicting the power output of a peaker power plant as a function of the environmental conditions – since this would enable the grid operator to make economic trade-offs about the number of peaker plants to turn on (or whether to buy expensive power from another grid).

The referred CCPP in this work uses two gas turbines (GT) and one steam turbine (ST) together to produce up to $50\%$ more electricity from the same fuel than a traditional simple-cycle plant. The waste heat from the GTs is routed to the nearby two STs, which generate extra power. In this real environment, a thermodynamical analysis compels thousands of nonlinear equations whose solution is near unfeasible, taking too many computational, memory and time costs. This barrier is overcome by using a machine learning based approach, which is a frequent alternative instead of thermodynamical approaches [2]. Concretely, this work applies stream regression (SR) machine learning algorithms for a prediction analysis of a thermodynamic system, which is the mentioned CCPP. The correct prediction of its electrical power output is very relevant for the efficiency and economic operation of the plant, and maximizes the income from the available megawatt hours. The sustainability and reliability of the GTs depend highly on this electrical power output prediction, above all when it is subject to constraints of high profitability and contractual liabilities.

1.2 Stream Learning in the Big Data Era

The Big Data paradigm has gained momentum last decade because of its promise to deliver valuable insights to many real-world applications [3]. With the advent of this emerging paradigm comes not only an increase in the volume of available data, but also the notion of its arrival velocity, that is, these real-world applications generate data in real-time at rates faster than those that can be handled by traditional systems. This situation leads us to assume that we have to deal with a potentially infinite and ever-growing dataset that may arrive continuously (stream learning, SL) in batches of instances or instance by instance, in contrast to traditional systems (batch learning) where there is free access to all historical data. These traditional processing systems assume that data is at rest and simultaneously accessed. For instance, database systems can store large collections of data and allow users to run queries or transactions. The models based on batch processing do not continuously integrate new information into already constructed models but instead regularly reconstruct new models from the scratch. However, the incremental learning that is carried out by SL presents advantages for this particular stream processing by continuously incorporating information into its models, and traditionally aim for minimal processing time and space. Because of its ability of continuous large-scale and real-time processing, incremental learning has recently gained more attention in the context of Big Data [4]. SL also presents many new challenges and poses stringent conditions [5]: only a single sample (or a small batch of instances) is provided to the learning algorithm at every time instant, a very limited processing time, a finite amount of memory, and the necessity of having trained models at every scan of the streams of data. In addition, these streams of data may evolve over time and may be occasionally affected by a change in their data distribution (concept drift)[6], forcing the system to learn under non-stationary conditions.

We can find many examples of real-world SL applications [7], such as mobile phones, industrial process controls, intelligent user interfaces, intrusion detection, spam detection, fraud detection, loan recommendation, monitoring and traffic management, among others [8]. In this context, the Internet of Things (IoT) has become one of the main applications of SL [9], since it is producing huge quantity of data continuously in real-time. The IoT is defined as sensors and actuators connected by networks to computing systems [10], which monitors and manages the health and actions of connected objects or machines in real-time. Therefore, stream data analysis is becoming a standard to extract useful knowledge from what is happening at each moment, allowing people or organizations to react quickly when inconveniences emerge or when new trends appear, helping them to increase their performance.

1.3 CCPPs and Stream Learning Regression

The task of power output prediction can be seen as a process based on data streams, as we will show in this work. Even though the work [11] is perfectly adequate under specific conditions which allow a batch processing, and where the author assumed the possibility of storing all the historical data to process it and predict the electrical power output with machine learning regression algorithms, we tackle the same problem from a contemporary streaming perspective.

In this work we have considered a CCPP as a practical case of IoT application, where different sensors provide the required data to efficiently predict in real-time the full load electrical power output (see Figure 1). In fact, all data generated by IoT applications can be considered as streaming data since it is obtained in specific intervals of time. Power generation is a complex process, and understanding and predicting power output is an important element in managing a CCPP and its connection to the power grid.

Our view is closer to a reality where fast data can be huge, is in motion, and is closely connected, and where there are limited resources (e.g. time, memory) to process it. While it does not seem appropriate to retrain the learning algorithms every time new instances are available (what occurs in batch processing), a streaming perspective introduces significant enhancements in terms of data processing (less time and computational costs), algorithms training (they are updated every time new instances come), and presents a modernized vision of a CCPP considering it as an IoT application, and as a part of the Industry $4.0$ paradigm [12]. To the best of our knowledge, this is the first time that a SL approach is applied to CCPPs for electrical output prediction. This work could be widely replicated for other streaming prediction purposes in CCPPs, even more, it can serve as a practical example of SL application for modern electrical power industries that need to obtain benefits from the Big Data and IoT paradigms.

Our work uses some of the most known SR learning algorithms to successfully predict in an online manner the electrical power output by using a combination of input parameters defined by for GTs and STs (ambient temperature, vacuum, atmospheric pressure, and relative humidity). This work shows how the application of a SL perspective fits the purposes of a modern industry in which data flows constantly, analyzing the impact of several streaming factors (which should be considered before the streaming process starts) on the output prediction. It also compares the results represented by several error metrics and time processing of several SRs under different experiments, finding the most recommendable ones in the electrical power output prediction, aside from carrying out a statistical significance study.

This work is organized as follows. Section 2 provides a background about the topics of the manuscript. In Section 3 materials and methods are presented, whereas Section 4 describes the experimental work. Section 5 provides a discussion of the work, and then Section 6 finalizes by presenting the final conclusions of the work.

2 Related Work

The literature have undertaken related problems by using machine learning approaches. In [13, 11] the authors successfully applied several regression methods to predict the full load electrical power output of a CCPP. A different approach for the same goal was investigated in [14], where the authors presented a novel approach using a particle swarm optimization [15] trained feedforward neural network to predict power plant output. In line with this last study, the work in [16] developed a new artificial neural network optimized by particle swarm optimization for dew point pressure prediction. In [17] the authors applied forecasting methodologies, including linear and nonlinear regression, to predict GT behavior over time, which allows planning maintenance actions and saving costs, and also because unexpected stops can be avoided. This work [18] presents a comparison of two strategies for GT performance prediction, using statistical regression as technique to analyze dynamic plant signals. The prognostic approach to estimate the remaining useful life of GT engines before their next major overhaul was overcome in [19], where a combination of regression techniques were proposed to predict the remaining useful life of GT engines. In [20] was showed that regression models were good estimators of the response variables to carry out parametric based thermo-environmental and exergoeconomic analyses of CCPPs. The same authors were involved in [21] when using multiple polynomial regression models to correlate the response variables and predictor variables in a CCPP to carry out a thermo-environmental analysis. More recently, in [22] is presented a real-time derivative-driven regression method for estimating the performance of GTs under dynamic conditions. A scheme for performance-based prognostics of industrial GTs operating under dynamic conditions is proposed and developed in [23], where a regression method is implemented to locally represent the diagnostic information for subsequently forecasting the performance behavior of the engine.

Regarding the SL topic, many researches have focused on it due to its mentioned relevance, such as [24, 25, 26, 27, 28], and more recently in [29, 30, 31, 32]. The application of regression techniques to SL has been recently addressed in [33], where the authors cover the most important online regression methods. The work [34] deals with ensemble learning from data streams, and concretely it focused on regression ensembles. The authors of [35] propose several criteria for efficient sample selection in case of SL regression problems within an online active learning context. In general, we can say that regression tasks in SL have not received as much attention as classification tasks, and this was spotlighted in [36], where researchers carried out an study and an empirical evaluation of a set of online algorithms for regression, which includes the baseline Hoeffding-based regression trees, online option trees, and an online least mean squares filter.

Next we present the materials and methods to carry out the experimental benchmark.

3 Materials and Methods

3.1 System Description

The proposed CCPP is composed of two GTs, one ST and two heat recovery steam generators. In a CCPP, the electricity is generated by GTs and STs, which are combined in one cycle, and is transferred from one turbine to another [37]. The CCPP captures waste heat from the GT to increase efficiency and the electrical output. Basically, how a CCPP works is as follows (see Figure 1):

Gas turbine burns fuel

The GT compresses air and mixes it with fuel that is heated to a very high temperature. The hot air-fuel mixture moves through the GT blades, making them spin. The fast-spinning turbine drives a generator that converts a portion of the spinning energy into electricity

Heat recovery system captures exhaust

A Heat Recovery Steam Generator captures exhaust heat from the GT that would otherwise escape through the exhaust stack. The Heat Recovery Steam Generator creates steam from the GT exhaust heat and delivers it to the ST.

Steam turbine delivers additional electricity

The ST sends its energy to the generator drive shaft, where it is converted into additional electricity.

This type of CCPP is being installed in increasing number of plants around the world where there is access to substantial quantities of natural gas [38]. As it was reported in [11], the proposed CCPP is designed with a nominal generating capacity of $480$ megawatts, made up of $2$ X $160$ megawatts ABB $13$ E $2$ GTs, $2$ X dual pressure Heat Recovery Steam Generators and $1$ X $160$ megawatts ABB ST. GT load is sensitive to the ambient conditions; mainly ambient temperature (AT), atmospheric pressure (AP), and relative humidity (RH). However, ST load is sensitive to the exhaust steam pressure (or vacuum, V). These parameters of both GTs and STs are used as input variables, and the electrical power generating by both GTs and STs is used as a target variable in the dataset of this study. All of them are described in Table 1 and correspond to average hourly data received from the measurement points by the sensors denoted in Figure 1.

3.2 The Stream Learning Process

We define a SL process as one that generates on a given stream of training data $s_{1},s_{2},s_{3},...,s_{t}$ a sequence of models $h_{1},h_{2},h_{3},...,h_{t}$ . In our case $s_{i}$ is labeled training data $s_{i}=(x_{i},y_{i})\in\mathbb{R}^{n}\times\{1,...,C\}$ and $h_{i}\colon\mathbb{R}^{n}\{1,...,C\}$ is a model function solely depending on $h_{i-1}$ and the recent $p$ instances $s_{i},...,s_{i-p}$ with $p$ being strictly limited (in this work $p=1$ , representing a real case with a very stringent use case of online learning). The learning process in streaming is incremental [24], which means that we have to face the following challenges:

The stream algorithm adapts/learns gradually (i.e. $h_{i+1}$ is constructed based on $h_{i}$ without a complete retraining),

2.

Retains the previously acquired knowledge avoiding the effect of catastrophic forgetting [39], and

3.

Only a limited number of $p$ training instances are allowed to be maintained. In this work we have applied a real SL approach under stringent conditions in which instance storing is not allowed.

Therefore, data-intensive applications often work with transient data: some or all of the input instances are not available from memory. Instances in the stream arrive online (frequently one instance at a time) and can be read at most once, which constitutes the strongest constraint for processing data streams, and the system has to decide whether the current instance should be discarded or archived. Only selected past instances can be accessed by storing them in memory, which is typically small relative to the size of the data streams. When designing SL algorithms, we have to take several algorithmic and statistical considerations into account. For example, we have to face the fact that, as we cannot store all the inputs, we cannot unwind a decision made on past data. In batch learning processing, we have free access to all historical data gathered during the process, and then we can apply “preparatory techniques” such as pre-processing, feature selection or statistical analysis to the dataset, among others (see Figure 2). Yet the problem with stream processing is that there is no access to the whole past dataset, and we have to opt for one of the following strategies. The first one is to carry out the preparatory techniques every time a new batch of instances or one instance is received, which increments the computational cost and time processing; it may occur that the process flow cannot be stopped to carry out this preparatory process because new instances continue arriving, which can be a challenging task. The second one is to store a first group of instances (preparatory instances) and carry out those preparatory techniques and data stream analysis, applying the conclusions to the incoming instances. This latter case is very common when streaming is applied to a real environment and it has been adopted by this work. We will show later how the selection of the size of this first group of instances (it might depend on the available memory or the time we can take to collect or process these data) can be crucial to achieve a competitive performance in the rest of the stream.

Once these first instances have been collected, in this work we will apply three common preparatory techniques before the streaming process starts in order to prepare our SRs:

Feature selection

It is one of the core concepts in machine learning that hugely impacts on the performance of models; irrelevant or partially relevant features can negatively impact model performance. Feature selection can be carried out automatically or manually, and selects those features which contribute most to the target variable. Its goal is to reduce overfitting, to improve the accuracy, and to reduce time training. In this work we will show how the feature selection impacts on the final results.

Hyper-parameter tuning

A hyper-parameter is a parameter whose value is set before the learning process begins, and this technique tries to choose a set of optimal hyper-parameters for a learning algorithm in order to prevent overfitting and to achieve the maximum performance. There are two main different methods for optimizing hyper-parameters: grid search and random search. The first one works by searching exhaustively through a specified subset of hyper-parameters, guaranteeing to find the optimal combination of parameters supplied, but the drawback is that it can be very time consuming and computationally expensive. The second one searches the specified subset of hyper-parameters randomly instead of exhaustively, being its major benefit that decreases processing time, but without guaranteeing to find the optimal combination of hyper-parameters. In this work we have opted for a random search strategy considering a real scenario where computational resources and time are limited.

Pre-training

Once we have isolated a set of instances to carry out the previous techniques, why do not we also use these instances to train our SRs before the streaming process starts? As we will see in Section 4.3, where the test-then-train evaluation is explained, by carrying out a pre-training process our algorithms will obtain a better prediction than if they were tested after being trained by one single instance.

3.3 Stream Regression Algorithms

A SL algorithm, like every machine learning method, estimates an unknown dependency between the independent input variables, and a dependent target variable, from a dataset. In our work, SRs predict the electrical power output of a CCPP from a dataset which consists of couples $(\textbf{x}_{t},y_{t})$ (i.e. an instance), and they build a mapping function $\hat{y_{t}}=(\textbf{x}_{t},y_{t})$ by using these couples. Their goal is to select the best function that minimizes the error between the actual output $(y_{t})$ of a system and predicted output $(\hat{y_{t}})$ based on instances of the dataset (training instances).

The prediction of a real value (regression) is a very frequent problem researched in the machine learning field [40], thus they are used to control response of a system for predicting a numeric target feature. Many real-world challenges are solved as regression problems, and evaluated using machine learning approaches to develop predictive models. Concretely, the following proposed algorithms have been specifically designed to run on real-time, being capable of learning incrementally every time a new instance arrives. They have been selected due to their wide use in the SL community, and because their implementation can be easily found in three well-known Python frameworks, scikit-multiflow [41], scikit-garden111https://github.com/scikit-garden/scikit-garden and scikit-learn [42].

Passive-Aggressive Regressor (PAR)

The Passive-Aggressive technique focuses on the target variable of linear regression functions, $\hat{y_{t}}=\textbf{w}_{t}^{T}\cdot\textbf{x}_{t}$ , where $\textbf{w}_{t}$ is the incrementally learned vector. When a prediction is made, the algorithm receives the true target value $y_{t}$ and suffers an instantaneous loss ( $\varepsilon$ -insensitive hinge loss function). This loss function was specifically designed to work with stream data and it is analogous to a standard hinge loss. The role of $\varepsilon$ is to allow a low tolerance of prediction errors. Then, when a round finalizes, the algorithm uses $\textbf{w}_{t}$ and the instance $(\textbf{x}_{t},y_{t})$ to produce a new weight vector $\textbf{w}_{t+1}$ , which will be used to extend the prediction on the next round. In [43] the adaptation to learn regression is explained in detail.

Stochastic Gradient Descent Regressor (SGDR)

Linear model fitted by minimizing a regularized empirical loss with stochastic gradient descent (SGD) [44] is one of the most popular algorithms to perform optimization for machine learning methods. There are three variants of gradient descent: batch gradient descent (BGD), SGD, and mini-batch gradient descent (mbGD). They differ in how much data we use to compute the gradient of the objective function; depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update. BGD and mbGD perform redundant computations for large datasets, as they recompute gradients for similar instances before each parameter update. SGD does away with this redundancy by performing one update at a time; it is therefore usually much faster and it is often used to learn online [45].

Multi-layer Perceptron Regressor (MLPR)

Multi-layer Perceptron (MLP) [46] learns a non-linear function approximator for either classification or regression. MLPR uses a MLP that trains using backpropagation with no activation function in the output layer, which can also be seen as using the identity function as activation function. It uses the square error as the loss function, and the output is a set of real values.

Regression Hoeffding Tree (RHT)

It is a regression tree that is able to perform regression tasks. A Hoeffding Tree (HT) or a Very Fast Decision Tree (VFDT) [47] is an incremental anytime decision tree induction algorithm that is capable of learning from massive data streams, assuming that the distribution generating instances does not change over time, and exploiting the fact that a small instance can often be enough to choose an optimal splitting attribute. The idea is supported mathematically by the Hoeffding bound, which quantifies the number of instances needed to estimate some statistics within the goodness of an attribute. A RHT can be seen as a Hoeffding Tree with two modifications: instead of using information gain to split, it uses variance reduction; and instead of using majority class and naive bayes at the leaves, it uses target mean, and the perceptron [48].

Regression Hoeffding Adaptive Tree (RHAT)

In this case, RHAT is like RHT but using ADWIN [49] to detect drifts and perceptron to make predictions. As it has been previously mentioned, streams of data may evolve over time and may show a change in their data distribution, what provokes that learning algorithms become obsolete. By detecting these drifts we are able to suitably update our algorithms to the new data distribution [25].

Mondrian Tree Regressor (MTR)

The MTR, unlike standard decision tree implementations, does not limit itself to the leaf in making predictions. It takes into account the entire path from the root to the leaf and weighs it according to the distance from the bounding box in that node. This has some interesting properties such as falling back to the prior mean and variance for points far away from the training data. This algorithm has been adapted by the scikit-garden framework to serve as a regressor algorithm.

Mondrian Forest Regressor (MFR)

A MFR [50] is an ensemble of MTRs. As in any ensemble of learners, the variance in predictions is reduced by averaging the predictions from all learners (Mondrian trees). Ensemble-based methods are among the most widely used techniques for data streaming, mainly due to their good performance in comparison to strong single learners while being relatively easy to deploy in real-world applications [27].

4 Comparative Analysis

4.1 Dataset Description and Exploratory Analysis

The dataset contains $9,568$ data points collected from a CCPP over $6$ years ( $2006-2011$ ), when the power plant was set to work with full load over $674$ different days. The Data and Source Code Availability section at the end of the manuscript contains the details and references of the dataset. Features described in Table 1 consist of hourly average ambient temperature, ambient pressure, relative humidity, and exhaust vacuum to predict the net hourly electrical energy output of the plant. Although it is a problem that has already been successfully tackled from a batch processing perspective [11] due to its manageable number of instances and its data arriving rate, it could be easily transposed to a streaming scenario in which the available data would be huge (instances collected over many years) and in which the data arriving rate would be very constrained (e.g. instances received every second). This more realistic IoT scenario would allow CCPPs to manage a Big Data approach, being able to predict the electrical output every time (e.g. every second) new data is available, and detecting anomalies long before in order to take immediate action.

As we can see in Table 1, our dataset highly varies in magnitudes, units and ranges. Feature scaling can vary our results a lot while using certain algorithms and have a minimal or no effect in others. It is recommendable to scale the features when algorithms compute distances (very often Euclidean distances) or assume normality. In this work we have opted for the min-max scaling method which brings the value between [math] and $1$ .

The input variables (AT, V, AP, RH) affect differently the target variable (PE). Figure 4 shows the correlation between the input and the target variables. On the one hand, we observe how an increase in AT produces a decrease in PE, with a minimal vertical spread of scatter points which indicate a strong inverse relationship between them. This conclusion is supported by a correlation value of $-0.95$ in Figure 3. In fact, there are some studies about GTs [51, 52, 53] which show the effect of AT on the performance of CCPPs. The performance reduction due to an increase in temperature is known to stem from the decrease in the density of inlet air.

On the other hand, we can see how with an increase in V produces a decrease in PE, and it can be also said that there is a strong inverse relationship between them. In this case, the spread is slightly larger than the variable AT, which hints at a slightly weaker relationship. This conclusion is also supported by a correlation value of $-0.87$ in Figure 3. As it has been seen in Figure 1, the CCPP uses a ST which leads to a considerable increase in total electrical efficiency. And when all other variables remain constant, V is known to have a negative impact on condensing-type turbine efficiency [54].

In the case of AP and RH, despite PE increases when they increase, Figure 4 depicts a big vertical spread of scatter points, which indicates weak positive relationships that are also confirmed in Figure 3, where $0.52$ and $0.39$ respectively are shown as the correlation values for these variables. AP is also responsible for the density inlet air, and when all other variables remain constant PE increases with increasing AP [51]. In the case of RH, increases the exhaust-gas temperature of GTs which leads to an increase in the power generated by the ST [51, 52, 53, 55].

4.2 Prediction Metrics

The quality of a regression model is how well its predictions match up against actual values (target values), and we use error metrics to judge the quality of this model. They enable us to compare regressions against other regressions with different parameters. In this work we use several error metrics because each one gives us a complementary insight of the algorithms performance.

Mean Absolute Error (MAE)

It is an easily interpretable error metric that does not indicate whether or not the model under or overshoots actual data. MAE is the average of the absolute difference between the predicted values and observed value. A small MAE suggests the model is great at prediction, while a large MAE suggests that the model may have trouble in certain areas. A MAE of [math] means that the model is a perfect predictor of the outputs. MAE is defined as:

[TABLE]

Root Mean Square Error (RMSE)

It represents the sample standard deviation of the differences between predicted values and observed values (called residuals). RMSE is defined as:

[TABLE]

MAE is easy to understand and interpret because it directly takes the average of offsets, whereas RMSE penalizes the higher difference more than MAE. However, even after being more complex and biased towards higher deviation, RMSE is still the default metric of many models because loss function defined in terms of RMSE is smoothly differentiable and makes it easier to perform mathematical operations. Researchers will often use RMSE to convert the error metric back into similar units, making interpretation easier.

Mean Square Error (MSE)

It is just like MAE, but squares the difference before summing them all instead of using the absolute value. We can see this difference in the equation below:

[TABLE]

Because MSE is squaring the difference, will almost always be bigger than the MAE. Large differences between actual and predicted are punished more in MSE than in MAE. In case of outliers presence, the use of MAE is more recommendable since the outlier residuals will not contribute as much to the total error as MSE.

R Squared ( $R^{2}$ )

It is often used for explanatory purposes and explains how well the input variables explain the variability in the target variable. Mathematically, it is given by:

[TABLE]

4.3 Streaming Evaluation Methodology

Evaluation is a fundamental task to know when an approach is outperforming another method only by chance, or when there is a statistical significance to that claim. In the case of SL, the methodology is very specific to consider the fact that not all data can be stored in memory (e.g. in online learning only one instance is processed at each time). Data stream regression is usually evaluated in the on-line setting, which is depicted in Figure 7, and where data is not split into training and testing set. Instead, each model predicts subsequently one instance, which is afterwards used for the construction of the next model. In contrast, in the traditional evaluation for batch processing (see Figures 5 and 6 for non-incremental and incremental types respectively) all data used during training is obtained from the training set.

We have followed this evaluation methodology, proposed in [56, 57, 58], which recommends to follow these guidelines for streaming evaluation:

Error estimation

We have used an interleaved test-then-train scheme, where each instance is firstly used for testing the model before it is used for training, and from this, the error metric is incrementally updated. The model is thus always being tested on instances it has not yet seen.

Performance evaluation measures

In Section 4.2 we have already detailed the prediction metrics used in this work.

Statistical significance

When comparing regressors, it is necessary to distinguish whether a regressor is better than another one only by chance, or whether there is a statistical significance to ensure that. The analysis of variance (ANOVA test [59]) is used to determine whether there are any statistically significant differences between the means of several independent groups. As in [11], in this work it is also used to compare results of machine learning experiments [60]. The idea is to test the null hypothesis (all regressors are equal), and the alternative hypothesis is that at least one pair is significantly different. In order to know how different one SR is from each other, we will also perform a multiple pairwise comparison analysis using Tukey’s range test [61].

Cost measure

We have opted for measuring the processing time (in seconds) of SRs in each experiment. The computer used in the experiments is based on a x86_64 architecture with $8$ processors Intel(R) Core(TM) i7 at $2.70$ GHz, and $32$ DDR $4$ memory running at $2,133$ MHz.

4.4 Experiments

We have designed an extensive experimental benchmark in order to find out the most suitable SR method for electrical power prediction in CCPPs, by comparing in terms of error metrics and time processing, $7$ widely used SRs. The Data and Source Code Availability section at the end of the manuscript contains the access to the source code for this experimentation. We have also carried out an ANOVA test to know about the statistical significance of the experiments, and a Tukey’s test to measure the differences between SR pair-wises.

The experimental benchmark has been divided into four different experiments (see Table 2) which have considered two preparatory sizes and two feature selection options, and it is explained in Algorithm 1. The idea is to observe the impact of the number of instances selected for the preparatory phase when the streaming process finalizes, and also to test the relevance of the feature selection process in this streaming scenario. Each experiment has been run $25$ times, and the experimental benchmark has followed the scheme depicted in Figure 2.

The experiments have been carried out under the scikit-multiflow framework [41], which has been implemented in Python language [62] due to its current popularity in the machine learning community. Inspired by the most popular open source Java framework for data stream mining, Massive Online Analysis (MOA) [58], scikit-multiflow includes a collection of widely used SRs (RHT and RHAT have been selected for this work), among other streaming algorithms (classification, clustering, outlier detection, concept drift detection and recommender systems), datasets, tools, and metrics for SL evaluation. It complements scikit-learn [42], whose primary focus is batch learning (despite the fact that it also provides researchers with some SL methods: PAR, SGDR and MLPR have been selected for this work) and expands the set of machine learning tools on this platform. The scikit-garden framework in turn complements the experiments by proving the MTR and MFR SRs.

Regarding the feature selection process, in contrast to the study carried out in [11] where different subsets of features were tested manually, we have opted for an automatic process. It is based on the feature importance, which stems from its Pearson correlation [63] with the target variable: if it is higher than a threshold ( $0.65$ ), then it will be considered for the streaming process. As this is a streaming scenario, and thus we do not know the whole dataset beforehand, we have carried out the feature selection process only with the preparatory instances in each of the $25$ runs for Exp1 and Exp3 experiments. After that, we have assumed this selection of features for the rest of the streaming process.

Finally, for the hyper-parameter tuning process, we have optimized the parameters by using a randomized and cross-validated search on hyper-parameters provided by scikit-learn222https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html. In contrast to other common option called cross-validated grid-search, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The result in parameter settings is quite similar in both cases, while the run time for randomized search is drastically lower. As in the previous case, we have also carried out the hyper-parameter tuning process only with the preparatory instances in each of the $25$ runs for all experiments. After that, we have assumed again the set of tuned parameters for the rest of the streaming process.

4.5 Results

In this section we present the results of the SRs following the evaluation methodology presented in Section 4.3. Tables 3, 4, 5, 6 show the error metrics and time processing of each SR for the experiments $1$ , $2$ , $3$ and $4$ respectively. Tables 7, 8, 9, and 10 compiles the most suitable parameters (hyper-parameters) for SRs in each experiment. Finally, we present the results of the one-way ANOVA test for each experiment (with $25$ runs) and SR, which is based on the $R^{2}$ error metric. The null hypothesis states that the means of all SRs in the experiment are equal, and the alternative hypothesis is that at least one pair is significantly different. In the next section we will discuss about these results and their interpretations.

Next, we will firstly provide the results of the hyper-parameter tuning process in each experiment. Tables 7, 8, 9, and 10 show the values for the most relevant parameters of each SR. In case of any need for checking the details of any parameter, please refer to the frameworks described previously: scikit-learn (PAR, SGDR, and MLPR), scikit-multiflow (RHT and RHAT), and scikit-garden (MTR and MFR). And secondly, we will show in Table 11 the results of the feature selection process in experiments $1$ and $3$ .

Finally, Figure 8 depicts the data distribution of ANOVA tests for each experiment. The p-values obtained from ANOVA analysis for each experiment ( $p_{1}=5.82\times e^{-25}$ , $p_{2}=1.78\times e^{-10}$ , $p_{3}=2.46\times e^{-42}$ , $p_{4}=1.06\times e^{-17}$ ) are significant ( $p_{i}<0.05$ ), and therefore, we can conclude that there are significant differences among SRs performances. In order to know how different one SR is from each other, we have performed a Tukey’s range test for each experiment (see Appendix A); results suggest that except some cases, all other pairwise comparisons reject null hypothesis and indicate statistical significant differences.

5 Discussion

We start the discussion by highlighting the relevance of having a representative set of preparatory instances in a SL process. As it was introduced in Section 3.2, in streaming scenarios it is not possible to access all historical data, and then it is required to apply some strategy to make assumptions for the incoming data, unless a drift occurs (in which case it would be necessary an adaptation to the new distribution). One of these strategies consists of storing the first instances of the stream (preparatory instances) to carry out a set of preparatory techniques that make the streaming algorithms ready for the streaming process. We have opted for this strategy in our work, and in this section we will explain the impact of these preparatory process on the final performance of the SRs.

Preparatory techniques contribute to improve the performance of the SRs. Theoretically, by selecting a subset of features (feature selection) that contributes most to the prediction variable, we avoid irrelevant or partially relevant features that can negatively impact on the model performance. By selecting the most suitable parameters of algorithms (hyper-parameter tuning), we obtain SRs better adjusted to data. And by training our SRs before the streaming process starts (pre-training), we obtain algorithms ready for the streaming process with better performances. The drawback lies in the fact that as many instances we collect at the beginning of the process, as much time the preparatory techniques will need to be carried out. This is a trade-off that we should have to consider in each scenario, apart from the limits previously mentioned.

Regarding the number of the preparatory instances, as it often occurs with machine learning techniques, the more instances for training (or other purposes) are available, the better the performance of the SRs can be, because data distribution is better represented with more data and the SRs are more trained and adjusted to the data distribution. But on the other hand, the scenario usually poses limits in terms of memory size, computational capacity, or the moment in which the streaming process has to start, among others. Comparing the experiments $1$ and $3$ (see Tables 3 and 5 where the feature selection process was carried out and the preparatory instances were a $5\%$ and $20\%$ of the dataset respectively) with the experiments $2$ and $4$ (see Tables 4 and 6 where the feature selection process was not carried out and the preparatory instances were also a $5\%$ and $20\%$ of the dataset respectively), we observe how in almost all cases (except for MTR and MFR when feature selection was carried out) the error metrics improve when the number of preparatory instances is larger. Therefore, by setting aside a group of instances for preparatory purposes, we can generally achieve better results for these stream learners.

In the case of the feature selection process, we deduce from the comparison between Tables 3 and 4 that this preparatory technique improves the performance of RHT and RHAT, and it also reduces their processing time. For PAR, SGDR, and MLPR, it achieves a similar performance but also reduces their processing time. Thus it is recommendable for all of them, except for MTR and MFR, when the preparatory size is $5\%$ . In the case of the comparison between Tables 5 and 6, this preparatory technique improves the performances of PAR and RHAT, and it also reduces o maintains their processing time. For SGDR, MLPR and RHT the performances and the processing times are very similar. Thus it is also recommendable for all of them, except again for MTR and MFR, when the preparatory size is $20\%$ . In what refers to which features have been selected for the streaming process in the experiments $1$ and $3$ , we see in Table 11 how AT and V have been preferred over the rest by the hyper-parameter tuning method, which has also been confirmed in Section 4.1 due to their correlation with the target variable (PE).

Regarding the selection of the best SR, Tables 3, 4, 5, and 6 show how MLP and RHT show the best error metrics for both preparatory sizes when the feature selection process is carried out. When there is no a feature selection process, then the best error metrics are achieved by MFR. However, in terms of processing time, SGDR and MTR are the fastest stream learners. Due to the fact that we have to find a balance between error metric results and time processing, we recommend RHT.

Finally, the ANOVA and Tukey’s range tests have confirmed the statistical significance of this study (see Appendix Appendix A).

6 Conclusion

This work has presented a comparison of streaming regressors for electrical power prediction in a combined cycle power plant. This prediction problem had been tackled with the traditional thermodynamical analysis, which had shown to be computational and time processing expensive. However, some studies have addressed this problem by applying machine learning techniques, such as regression algorithms, and managing to reduce the computational and time costs. These new approaches have considered the problem under a batch learning perspective in which data is assumed to be at rest, and where regression models do not continuously integrate new information into already constructed models. Our work presents a new approach for this scenario in which data is arriving continuously and where regression models have to learn incrementally. This approach is closer to the emerging Big Data and IoT paradigms.

The results show how competitive error metrics and processing times have been achieved when applying a SL approach to this specific scenario. Concretely, this work has identified RHT as the most recommendable technique to achieve the electrical power prediction. We have also highlighted the relevance of the preparatory techniques to make the streaming algorithms ready for the streaming process, and at the same time the importance of selecting properly the number of preparatory instances. Regarding the importance of the features, as in previous cases which tackled the same problem from a batch learning perspective, we do recommend to carry out a feature selection process for all SRs (except for MTR and MFR) because it reduces the streaming processing time and at the same time it is worthy due to the performance gain. Finally, as future work, we would like to transfer this SL approach to other processes in combined cycle power plants, and even to other kinds of electrical power plants.

Acknowledgements

This work was supported by the EU project iDev40. This project has received funding from the ECSEL Joint Undertaking (JU) under grant agreement No 783163. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and Austria, Germany, Belgium, Italy, Spain, Romania. It has also been supported by the Basque Government (Spain) through the project VIRTUAL (KK-2018/00096).

Data and Source Code Availability

Source code and dataset related to this article can be found at:

https://github.com/TxusLopez/Streaming_CCPP. The CCPP dataset has been taken from https://github.com/YungChunLu/UCI-Power-Plant, and originally was used in [11] and taken from the UCI repository at:

https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant.

Appendix Appendix A Tukey’s Range Tests

References

[1]

Black, Veatch, Black and veatch strategic directions: Electric report, Tech. rep., Black and Veatch (2018).

URL https://pages.bv.com/SDR-Electric-Download.html

[2]

U. Kesgin, H. Heperkan, Simulation of thermodynamic systems using soft computing techniques, International journal of energy research 29 (7) (2005) 581–611.

[3]

Z.-H. Zhou, N. V. Chawla, Y. Jin, G. J. Williams, Big data opportunities and challenges: Discussions from data analytics perspectives [discussion forum], IEEE Computational Intelligence Magazine 9 (4) (2014) 62–74.

[4]

M. Chen, S. Mao, Y. Liu, Big data: A survey, Mobile networks and applications 19 (2) (2014) 171–209.

[5]

P. Domingos, G. Hulten, A general framework for mining massive data streams, Journal of Computational and Graphical Statistics 12 (4) (2003) 945–949.

[6]

J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, G. Zhang, Learning under concept drift: A review, IEEE Transactions on Knowledge and Data Engineering (2018) 1–1.

[7]

C. Alippi, Intelligence for embedded systems, Springer, 2014.

[8]

I. Žliobaitė, M. Pechenizkiy, J. Gama, An overview of concept drift applications, in: Big Data Analysis: New Algorithms for a New Society, Springer, 2016, pp. 91–114.

[9]

G. De Francisci Morales, A. Bifet, L. Khan, J. Gama, W. Fan, Iot big data stream mining, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2016, pp. 2119–2120.

[10]

J. Manyika, M. Chui, P. Bisson, J. Woetzel, R. Dobbs, J. Bughin, D. Aharon, Unlocking the potential of the internet of things, McKinsey.

[11]

P. Tüfekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, International Journal of Electrical Power & Energy Systems 60 (2014) 126–140.

[12]

H. Lasi, P. Fettke, H.-G. Kemper, T. Feld, M. Hoffmann, Industry 4.0, Business & information systems engineering 6 (4) (2014) 239–242.

[13]

H. Kaya, P. Tüfekci, F. S. Gürgen, Local and global learning methods for predicting power of a combined gas & steam turbine, in: Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE, 2012, pp. 13–18.

[14]

M. Rashid, K. Kamal, T. Zafar, Z. Sheikh, A. Shah, S. Mathavan, Energy prediction of a combined cycle power plant using a particle swarm optimization trained feedforward neural network, in: 2015 International Conference on Mechanical Engineering, Automation and Control Systems (MEACS), IEEE, 2015, pp. 1–5.

[15]

J. Kennedy, Particle swarm optimization, Encyclopedia of machine learning (2010) 760–766.

[16]

A. K. Manshad, H. Rostami, S. M. Hosseini, H. Rezaei, Application of artificial neural network–particle swarm optimization algorithm for prediction of gas condensate dew point pressure and comparison with gaussian processes regression–particle swarm optimization algorithm, Journal of Energy Resources Technology 138 (3) (2016) 032903.

[17]

A. Cavarzere, M. Venturini, Application of forecasting methodologies to predict gas turbine behavior over time, Journal of Engineering for Gas Turbines and Power 134 (1) (2012) 012401.

[18]

R. Sekhon, H. Bassily, J. Wagner, A comparison of two trending strategies for gas turbine performance prediction, Journal of Engineering for Gas Turbines and Power 130 (4) (2008) 041601.

[19]

Y. Li, P. Nilkitsaranont, Gas turbine performance prognostic for condition-based maintenance, Applied energy 86 (10) (2009) 2152–2161.

[20]

A. G. Memon, R. A. Memon, K. Harijan, M. A. Uqaili, Parametric based thermo-environmental and exergoeconomic analyses of a combined cycle power plant with regression analysis and optimization, Energy conversion and management 92 (2015) 19–35.

[21]

A. G. Memon, R. A. Memon, K. Harijan, M. A. Uqaili, Thermo-environmental analysis of an open cycle gas turbine power plant with regression modeling and optimization, Journal of the Energy Institute 87 (2) (2014) 81–88.

[22]

E. Tsoutsanis, N. Meskin, Derivative-driven window-based regression method for gas turbine performance prognostics, Energy 128 (2017) 302–311.

[23]

E. Tsoutsanis, N. Meskin, M. Benammar, K. Khorasani, A dynamic prognosis scheme for flexible operation of gas turbines, Applied energy 164 (2016) 686–701.

[24]

V. Losing, B. Hammer, H. Wersing, Incremental on-line learning: A review and comparison of state of the art algorithms, Neurocomputing 275 (2018) 1261–1274.

[25]

I. Khamassi, M. Sayed-Mouchaweh, M. Hammami, K. Ghédira, Discussion and review on evolving data streams and concept drift adapting, Evolving systems 9 (1) (2018) 1–23.

[26]

S. Ramírez-Gallego, B. Krawczyk, S. García, M. Woźniak, F. Herrera, A survey on data preprocessing for data stream mining: Current status and future directions, Neurocomputing 239 (2017) 39–57.

[27]

H. M. Gomes, J. P. Barddal, F. Enembreck, A. Bifet, A survey on ensemble learning for data stream classification, ACM Computing Surveys (CSUR) 50 (2) (2017) 23.

[28]

M. Tennant, F. Stahl, O. Rana, J. B. Gomes, Scalable real-time classification of data streams with concept drift, Future Generation Computer Systems 75 (2017) 187–199.

[29]

J. L. Lobo, J. Del Ser, M. N. Bilbao, C. Perfecto, S. Salcedo-Sanz, Dred: An evolutionary diversity generation method for concept drift adaptation in online learning environments, Applied Soft Computing 68 (2018) 693–709.

[30]

J. Lobo, I. Laña, J. S. Del, M. Bilbao, N. Kasabov, Evolving spiking neural networks for online learning over drifting data streams., Neural Networks 108 (2018) 1–19.

[31]

P. R. Almeida, L. S. Oliveira, A. S. Britto Jr, R. Sabourin, Adapting dynamic classifier selection for concept drift, Expert Systems with Applications 104 (2018) 67–85.

[32]

R. S. M. de Barros, S. G. T. de Carvalho Santos, An overview and comprehensive comparison of ensembles for concept drift, Information Fusion.

[33]

A. A. Benczúr, L. Kocsis, R. Pálovics, Online Machine Learning in Big Data Streams, arXiv e-prints (2018) arXiv:1802.05872arXiv:1802.05872.

[34]

B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, M. Woźniak, Ensemble learning for data stream analysis: A survey, Information Fusion 37 (2017) 132–156.

[35]

E. Lughofer, M. Pratama, Online active learning in data stream regression using uncertainty sampling based on evolving generalized fuzzy models, IEEE Transactions on fuzzy systems 26 (1) (2017) 292–309.

[36]

E. Ikonomovska, J. Gama, S. Džeroski, Online tree-based ensembles and option trees for regression on evolving data streams, Neurocomputing 150 (2015) 458–470.

[37]

L. Niu, X. Liu, Multivariable generalized predictive scheme for gas turbine control in combined cycle power plant, in: 2008 IEEE Conference on Cybernetics and Intelligent Systems, IEEE, 2008, pp. 791–796.

[38]

V. Ramireddy, An overview of combined cycle power plant, Tech. rep., last accessed 2019-04-25 (August 2012).

URL https://electrical-engineering-portal.com/an-overview-of-combined-cycle-power-plant

[39]

Z. Chen, B. Liu, Lifelong machine learning, Synthesis Lectures on Artificial Intelligence and Machine Learning 10 (3) (2016) 1–145.

[40]

N. R. Draper, H. Smith, Applied regression analysis, Vol. 326, John Wiley & Sons, 2014.

[41]

J. Montiel, J. Read, A. Bifet, T. Abdessalem, Scikit-multiflow: a multi-output streaming framework, The Journal of Machine Learning Research 19 (1) (2018) 2915–2914.

[42]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python, Journal of machine learning research 12 (Oct) (2011) 2825–2830.

[43]

K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, Y. Singer, Online passive-aggressive algorithms, Journal of Machine Learning Research 7 (Mar) (2006) 551–585.

[44]

L. Bottou, Large-scale machine learning with stochastic gradient descent, in: Proceedings of COMPSTAT’2010, Springer, 2010, pp. 177–186.

[45]

T. Zhang, Solving large scale linear prediction problems using stochastic gradient descent algorithms, in: Proceedings of the twenty-first international conference on Machine learning, ACM, 2004, p. 116.

[46]

D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al., Learning representations by back-propagating errors, Cognitive modeling 5 (3) (1988) 1.

[47]

P. Domingos, G. Hulten, Mining high-speed data streams, in: Kdd, Vol. 2, 2000, p. 4.

[48]

E. Ikonomovska, J. Gama, S. Džeroski, Learning model trees from evolving data streams, Data mining and knowledge discovery 23 (1) (2011) 128–168.

[49]

A. Bifet, R. Gavalda, Learning from time-changing data with adaptive windowing, in: Proceedings of the 2007 SIAM international conference on data mining, SIAM, 2007, pp. 443–448.

[50]

B. Lakshminarayanan, D. M. Roy, Y. W. Teh, Mondrian forests: Efficient online random forests, in: Advances in neural information processing systems, 2014, pp. 3140–3148.

[51]

F. R. P. Arrieta, E. E. S. Lora, Influence of ambient temperature on combined-cycle power-plant performance, Applied Energy 80 (3) (2005) 261–272.

[52]

A. De Sa, S. Al Zubaidy, Gas turbine performance at varying ambient temperature, Applied Thermal Engineering 31 (14-15) (2011) 2735–2739.

[53]

H. H. Erdem, S. H. Sevilgen, Case study: Effect of ambient temperature on the electricity production and fuel consumption of a simple cycle gas turbine in turkey, Applied Thermal Engineering 26 (2-3) (2006) 320–326.

[54]

M. Patel, N. Nath, Improve steam turbine efficiency, Hydrocarbon Processing 79 (6) (2000) 85–90.

[55]

J. J. Lee, T. S. Kim, et al., Development of a gas turbine performance analysis program and its application, Energy 36 (8) (2011) 5274–5285.

[56]

J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, A. Bouchachia, A survey on concept drift adaptation, ACM computing surveys (CSUR) 46 (4) (2014) 44.

[57]

A. Bifet, G. de Francisci Morales, J. Read, G. Holmes, B. Pfahringer, Efficient online evaluation of big data stream classifiers, in: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, 2015, pp. 59–68.

[58]

A. Bifet, R. Gavaldà, G. Holmes, B. Pfahringer, Machine Learning for Data Streams with Practical Examples in MOA, MIT Press, 2018, https://moa.cms.waikato.ac.nz/book/.

[59]

H. Scheffe, The analysis of variance, Vol. 72, John Wiley & Sons, 1999.

[60]

E. Alpaydin, Introduction to machine learning, MIT press, 2009.

[61]

J. W. Tukey, et al., Comparing individual means in the analysis of variance, Biometrics 5 (2) (1949) 99–114.

[62]

T. E. Oliphant, Python for scientific computing, Computing in Science & Engineering 9 (3) (2007) 10–20.

[63]

J. Benesty, J. Chen, Y. Huang, I. Cohen, Pearson correlation coefficient, in: Noise reduction in speech processing, Springer, 2009, pp. 1–4.

Bibliography63

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Black, Veatch, Black and veatch strategic directions: Electric report , Tech. rep., Black and Veatch (2018). URL https://pages.bv.com/SDR-Electric-Download.html
2[2] U. Kesgin, H. Heperkan, Simulation of thermodynamic systems using soft computing techniques, International journal of energy research 29 (7) (2005) 581–611.
3[3] Z.-H. Zhou, N. V. Chawla, Y. Jin, G. J. Williams, Big data opportunities and challenges: Discussions from data analytics perspectives [discussion forum], IEEE Computational Intelligence Magazine 9 (4) (2014) 62–74.
4[4] M. Chen, S. Mao, Y. Liu, Big data: A survey, Mobile networks and applications 19 (2) (2014) 171–209.
5[5] P. Domingos, G. Hulten, A general framework for mining massive data streams, Journal of Computational and Graphical Statistics 12 (4) (2003) 945–949.
6[6] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, G. Zhang, Learning under concept drift: A review, IEEE Transactions on Knowledge and Data Engineering (2018) 1–1.
7[7] C. Alippi, Intelligence for embedded systems, Springer, 2014.
8[8] I. Žliobaitė, M. Pechenizkiy, J. Gama, An overview of concept drift applications, in: Big Data Analysis: New Algorithms for a New Society, Springer, 2016, pp. 91–114.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Real-time Electrical Power Prediction in a Combined Cycle Power Plant

Abstract

keywords:

1 Introduction

1.1 The Electrical Power Prediction for Combined Cycle Power Plants

1.2 Stream Learning in the Big Data Era

1.3 CCPPs and Stream Learning Regression

2 Related Work

3 Materials and Methods

3.1 System Description

Gas turbine burns fuel

Heat recovery system captures exhaust

Steam turbine delivers additional electricity

3.2 The Stream Learning Process

Feature selection

Hyper-parameter tuning

Pre-training

3.3 Stream Regression Algorithms

Passive-Aggressive Regressor (PAR)

Stochastic Gradient Descent Regressor (SGDR)

Multi-layer Perceptron Regressor (MLPR)

Regression Hoeffding Tree (RHT)

Regression Hoeffding Adaptive Tree (RHAT)

Mondrian Tree Regressor (MTR)

Mondrian Forest Regressor (MFR)

4 Comparative Analysis

4.1 Dataset Description and Exploratory Analysis

4.2 Prediction Metrics

Mean Absolute Error (MAE)

Root Mean Square Error (RMSE)

Mean Square Error (MSE)

R Squared (R2R^{2}R2)

4.3 Streaming Evaluation Methodology

Error estimation

Performance evaluation measures

Statistical significance

Cost measure

4.4 Experiments

4.5 Results

5 Discussion

6 Conclusion

Acknowledgements

Data and Source Code Availability

Appendix Appendix A Tukey’s Range Tests

References

R Squared ( $R^{2}$ )