Backward-Forward Algorithm: An Improvement towards Extreme Learning   Machine

Dibyasundar Das; Deepak Ranjan Nayak; Ratnakar Dash; Banshidhar Majhi

arXiv:1907.10282·cs.LG·October 8, 2019

Backward-Forward Algorithm: An Improvement towards Extreme Learning Machine

Dibyasundar Das, Deepak Ranjan Nayak, Ratnakar Dash, Banshidhar Majhi

PDF

Open Access

TL;DR

This paper introduces a backward-forward algorithm that improves extreme learning machines by reducing the number of hidden nodes needed and decreasing training iterations through a Moore-Penrose approximation-based supervised learning method.

Contribution

It presents a novel supervised learning approach using Moore-Penrose approximation to optimize input and output weights in fewer epochs, outperforming traditional extreme learning machines.

Findings

01

Requires fewer hidden nodes for generalization.

02

Reduces training iterations compared to back-propagation.

03

Outperforms existing extreme learning machine methods.

Abstract

The extreme learning machine needs a large number of hidden nodes to generalize a single hidden layer neural network for a given training data-set. The need for more number of hidden nodes suggests that the neural-network is memorizing rather than generalizing the model. Hence, a supervised learning method is described here that uses Moore-Penrose approximation to determine both input-weight and output-weight in two epochs, namely, backward-pass and forward-pass. The proposed technique has an advantage over the back-propagation method in terms of iterations required and is superior to the extreme learning machine in terms of the number of hidden units necessary for generalization.

Tables10

Table 1. TABLE I: List of symbols

Symbol	Meaning
$N$	Number of samples
$P$	Size of input nodes
$M$	Size of hidden nodes
$C$	Size of output nodes
$x_{j}$	Input vector ${[x_{(j, 1)}, x_{(j, 2)}, \dots, x_{(j, P)}]}^{T}$ where, $j = 1, 2, \dots, N$
$I$	Augmented input data-set
	$I = [\begin{matrix} x_{(1, 1)} & x_{(1, 2)} & \dots & x_{(1, P)} & 1 \\ x_{(2, 1)} & x_{(2, 2)} & \dots & x_{(2, P)} & 1 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ x_{(N, 1)} & x_{(N, 2)} & \dots & x_{(N, P)} & 1 \end{matrix}]$
$t_{j}$	Output vector ${[t_{(j, 1)}, t_{(j, 2)}, \dots, t_{(j, C)}]}^{T}$ where, $j = 1, 2, \dots, N$
$w_{i}$	Input weight $[w_{(i, 1)}, w_{(i, 2)}, \dots, w_{(i, P)}]$ where, $i = 1, 2, \dots, M$
$b_{i}$	Input bias where, $i = 1, 2, \dots, M$
$W$	Input weight
	$W = {[\begin{matrix} w_{(1, 1)} & w_{(1, 2)} & \dots & w_{(1, P)} & b_{1} \\ w_{(2, 1)} & w_{(2, 2)} & \dots & w_{(2, P)} & b_{2} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ x_{(M, 1)} & x_{(M, 2)} & \dots & w_{(M, P)} & b_{M} \end{matrix}]}^{T}$
$β_{k}$	Output weight $[β_{(k, 1)}, β_{(k, 2)}, \dots, β_{(k, M)}]$ where, $k = 1, 2, \dots, C$
$g (.)$	Activation function
${()}^{†}$	Pseudoinverse
$o r t h o (.)$	Orthogonal transformation

Table 2. TABLE II: Weight initialization scheme investigated in this work

Name	Description
Uniform random initialization	$W \sim U [l, u]$ , where, $l$ represents lower range and $u$ represents upper range of the uniform distribution $U$
Xavier initialization	$W \sim N (0, \frac{2}{n_{i n} + n_{o u t}})$ where $n_{i n}$ $n_{o u t}$ represent the input layer size (dimension of features) and the output layer size (number of classes) respectively.
ReLU initialization	$W \sim N (0, \sqrt{\frac{2}{n_{c}}})$ where, $n_{c}$ is hidden nodes size
Orthogonal initialization	Random orthogonal matrix each row with orthogonal vector

Table 3. TABLE III: Activation functions investigated in this work

Activation function	Expression
Linear	$g (x) = x$
Sigmoid	$g (x) = \frac{1}{1 + e^{- x}}$
ReLu	$g (x) = {\begin{matrix} x & if x > 0 \\ 0 & if x \leq 0 \end{matrix}$
Tanh	$g (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}$
Softsign	$g (x) = \frac{1}{1 + \| x \|}$
Sin	$g (x) = s i n (x)$
Cos	$g (x) = c o s (x)$
Sinc	$g (x) = {\begin{matrix} 1 & if x = 0 \\ \frac{s i n (x)}{x} & if x \neq 0 \end{matrix}$
LeakyReLu	$g (x) = {\begin{matrix} x & if x > 0 \\ 0.001 . x & if x \leq 0 \end{matrix}$
Gaussian	$g (x) = e^{- x^{2}}$
Bent Identity	$g (x) = \frac{\sqrt{x^{2} + 1} - 1}{2} + x$

Table 4. TABLE IV: RMSE comparison on sinc regression for 10 hidden nodes

Activation	Weight	ELM		BF ELM
function	initialization	RMSE	Test Time	RMSE	Test Time
Relu	ortho	0.18	0.000265	0.14	0.000245
	rand(0,1)	0.21	0.000264	0.20	0.000252
	rand(-1,1)	0.10	0.000260	0.16	0.000215
	xavier	0.21	0.000277	0.21	0.000232
	relu	0.09	0.000270	0.22	0.000304
Sigmoid	ortho	0.02	0.000305	0.03	0.000297
	rand(0,1)	0.03	0.000298	0.04	0.000296
	rand(-1,1)	0.02	0.000304	0.03	0.000293
	xavier	0.05	0.000315	0.06	0.000294
	relu	0.01	0.000296	0.02	0.000297
Tanh	ortho	0.03	0.000404	0.04	0.000392
	rand(0,1)	0.06	0.000407	0.07	0.000392
	rand(-1,1)	0.06	0.000410	0.03	0.000401
	xavier	0.06	0.000410	0.06	0.000402
	relu	0.01	0.000395	0.06	0.000395
Softsign	ortho	0.09	0.000153	0.13	0.000148
	rand(0,1)	0.12	0.000148	0.13	0.000138
	rand(-1,1)	0.11	0.000145	0.11	0.000140
	xavier	0.07	0.000144	0.06	0.000131
	relu	0.08	0.000144	0.13	0.000137
Sin	ortho	0.03	0.000292	0.12	0.000234
	rand(0,1)	0.01	0.000289	0.10	0.000238
	rand(-1,1)	0.01	0.000307	0.03	0.000232
	xavier	0.12	0.000291	0.04	0.000192
	relu	0.04	0.000176	0.04	0.000184
Cos	ortho	0.03	0.000213	0.03	0.000201
	rand(0,1)	0.01	0.000236	0.03	0.000199
	rand(-1,1)	0.01	0.000240	0.10	0.000182
	xavier	0.24	0.000257	0.09	0.000179
	relu	0.02	0.000243	0.02	0.000197
Sinc	ortho	0.08	0.000492	0.01	0.000444
	rand(0,1)	0.01	0.000565	0.03	0.000655
	rand(-1,1)	0.01	0.000702	0.03	0.000594
	xavier	0.02	0.000661	0.08	0.000606
	relu	0.01	0.000687	0.08	0.000496
BentIde	ortho	0.03	0.000169	0.03	0.000174
	rand(0,1)	0.08	0.000170	0.04	0.000166
	rand(-1,1)	0.03	0.000183	0.07	0.000166
	xavier	0.08	0.000171	0.03	0.000168
	relu	0.02	0.000178	0.03	0.000168
ArcTan	ortho	0.06	0.000212	0.11	0.000191
	rand(0,1)	0.06	0.000237	0.05	0.000208
	rand(-1,1)	0.08	0.000235	0.08	0.000189
	xavier	0.02	0.000220	0.06	0.000205
	relu	0.02	0.000198	0.03	0.000204

Table 5. TABLE V: Accuracy comparison on iris data-set for 6 hidden nodes

Activation	Weight	ELM		BF ELM
function	initialization	Acc	Test Time	Acc	Test Time
None	ortho	77.78	0.000038	77.78	0.000121
	rand(0,1)	77.78	0.000037	77.78	0.000087
	rand(-1,1)	77.78	0.000040	77.78	0.000121
	xavier	77.78	0.000036	80.00	0.000123
	relu	77.78	0.000029	77.78	0.000068
Relu	ortho	6.67	0.000016	80.00	0.000062
	rand(0,1)	77.78	0.000020	75.56	0.000062
	rand(-1,1)	82.22	0.000013	80.00	0.000060
	xavier	6.67	0.000022	86.67	0.000089
	relu	97.78	0.000014	95.56	0.000063
Sigmoid	ortho	88.89	0.000030	100.00	0.000143
	rand(0,1)	77.78	0.000027	97.78	0.000097
	rand(-1,1)	100.00	0.000032	100.00	0.000123
	xavier	91.11	0.000025	100.00	0.000157
	relu	88.89	0.000030	100.00	0.000105
Tanh	ortho	97.78	0.000031	97.78	0.000121
	rand(0,1)	71.11	0.000038	100.00	0.000110
	rand(-1,1)	82.22	0.000036	100.00	0.000111
	xavier	100.00	0.000040	95.56	0.000114
	relu	80.00	0.000039	97.78	0.000110
Softsign	ortho	77.78	0.000022	93.33	0.000090
	rand(0,1)	82.22	0.000022	93.33	0.000090
	rand(-1,1)	100.00	0.000020	100.00	0.000090
	xavier	97.78	0.000020	97.78	0.000090
	relu	88.89	0.000021	95.56	0.000087
Sin	ortho	93.33	0.000036	100.00	0.000108
	rand(0,1)	88.89	0.000037	100.00	0.000114
	rand(-1,1)	97.78	0.000032	97.78	0.000120
	xavier	100.00	0.000355	93.33	0.000153
	relu	100.00	0.000202	100.00	0.000111
Cos	ortho	97.78	0.000058	95.56	0.000120
	rand(0,1)	88.89	0.000040	95.56	0.000109
	rand(-1,1)	91.11	0.000036	100.00	0.000111
	xavier	95.56	0.000038	100.00	0.000105
	relu	86.67	0.000037	84.44	0.000113
LeakyRelu	ortho	93.33	0.000036	80.00	0.000111
	rand(0,1)	77.78	0.000031	100.00	0.000101
	rand(-1,1)	75.56	0.000031	100.00	0.000104
	xavier	88.89	0.000033	100.00	0.000113
	relu	100.00	0.000042	82.22	0.000103
BentIde	ortho	80.00	0.000031	95.56	0.000101
	rand(0,1)	86.67	0.000032	97.78	0.000106
	rand(-1,1)	97.78	0.000033	95.56	0.000106
	xavier	100.00	0.000034	100.00	0.000117
	relu	93.33	0.000033	97.78	0.000132
ArcTan	ortho	97.78	0.000041	97.78	0.000104
	rand(0,1)	80.00	0.000036	100.00	0.000113
	rand(-1,1)	100.00	0.000038	100.00	0.000109
	xavier	84.44	0.000038	100.00	0.000121
	relu	100.00	0.000043	97.78	0.000126

Table 6. TABLE VI: Accuracy comparison on Sat-Image test data-set with ELM for 20 hidden nodes

Activation	Weight	ELM		BF ELM
function	initialization	Acc	Test Time	Acc	Test Time
None	ortho	63.95	0.000407	74.55	0.000730
	rand(0,1)	62.85	0.000370	74.45	0.002063
	rand(-1,1)	65.45	0.000438	74.60	0.000654
	xavier	62.60	0.000372	74.50	0.001581
	relu	63.45	0.000391	74.50	0.000646
Relu	ortho	66.95	0.000572	80.55	0.002736
	rand(0,1)	64.80	0.000442	80.60	0.000827
	rand(-1,1)	68.55	0.000502	75.35	0.000970
	xavier	68.25	0.000500	79.65	0.000757
	relu	63.60	0.000513	81.15	0.002697
Sigmoid	ortho	53.95	0.000852	83.05	0.001171
	rand(0,1)	23.05	0.000541	82.55	0.001146
	rand(-1,1)	47.95	0.000844	80.35	0.000991
	xavier	73.30	0.000955	82.20	0.001273
	relu	63.85	0.002882	82.00	0.001010
Tanh	ortho	65.25	0.001031	82.85	0.001255
	rand(0,1)	23.05	0.000602	81.10	0.001398
	rand(-1,1)	34.30	0.000600	80.15	0.001204
	xavier	62.35	0.000898	81.55	0.001265
	relu	39.00	0.000629	80.90	0.001152
Softsign	ortho	68.25	0.001630	81.25	0.000740
	rand(0,1)	75.20	0.001567	82.15	0.000693
	rand(-1,1)	75.35	0.001555	81.60	0.000861
	xavier	78.95	0.000555	81.20	0.000992
	relu	78.20	0.000552	81.30	0.002090
Sin	ortho	18.05	0.000883	81.10	0.001281
	rand(0,1)	18.30	0.002236	79.25	0.000976
	rand(-1,1)	19.35	0.001087	80.60	0.001153
	xavier	76.25	0.002281	82.35	0.000971
	relu	24.30	0.001090	79.35	0.001147
Cos	ortho	17.70	0.001055	80.65	0.001890
	rand(0,1)	17.10	0.000940	79.05	0.002210
	rand(-1,1)	17.40	0.001169	79.90	0.001183
	xavier	69.20	0.000995	80.10	0.002716
	relu	27.90	0.000981	81.60	0.000988
LeakyRelu	ortho	72.20	0.000783	82.05	0.002142
	rand(0,1)	63.10	0.001910	79.90	0.000957
	rand(-1,1)	67.75	0.003062	77.40	0.001459
	xavier	71.40	0.002777	80.40	0.001300
	relu	69.10	0.002732	80.35	0.001703
BentIde	ortho	69.70	0.002916	82.00	0.000753
	rand(0,1)	64.30	0.001475	81.85	0.000944
	rand(-1,1)	65.60	0.000713	82.25	0.002872
	xavier	71.80	0.000621	82.50	0.000766
	relu	62.85	0.000561	81.80	0.000807
ArcTan	ortho	70.65	0.000779	82.10	0.000992
	rand(0,1)	77.65	0.000746	83.35	0.001195
	rand(-1,1)	76.75	0.000902	81.20	0.001165
	xavier	78.60	0.000882	81.00	0.000985
	relu	76.55	0.000851	79.75	0.001149

Table 7. TABLE VII: Accuracy comparison on Shuttle test data-set with ELM for 30 hidden nodes

Activation	Weight	ELM		BF ELM
function	initialization	Acc	Test Time	Acc	Test Time
None	ortho	87.94	0.001660	87.94	0.005870
	rand(0,1)	87.94	0.001673	87.94	0.005970
	rand(-1,1)	87.94	0.003528	87.94	0.002358
	xavier	87.94	0.003402	87.94	0.006252
	relu	87.94	0.003474	87.94	0.005913
Relu	ortho	92.88	0.008845	91.86	0.005580
	rand(0,1)	88.56	0.002351	95.92	0.005892
	rand(-1,1)	92.63	0.007760	94.24	0.007834
	xavier	95.93	0.003248	96.61	0.009495
	relu	92.71	0.006785	91.88	0.003122
Sigmoid	ortho	95.83	0.009473	94.70	0.005198
	rand(0,1)	79.15	0.003522	95.99	0.006812
	rand(-1,1)	92.95	0.004019	98.66	0.004478
	xavier	95.92	0.004249	96.54	0.014445
	relu	93.32	0.015859	95.94	0.014747
Tanh	ortho	92.02	0.004684	96.08	0.004964
	rand(0,1)	79.16	0.005102	95.92	0.004874
	rand(-1,1)	90.71	0.009397	96.11	0.004620
	xavier	89.91	0.005345	97.00	0.018385
	relu	92.28	0.005576	95.39	0.017629
Softsign	ortho	91.70	0.001886	98.60	0.006648
	rand(0,1)	79.11	0.001870	98.32	0.007419
	rand(-1,1)	89.14	0.001743	95.81	0.006926
	xavier	92.21	0.004412	94.92	0.006698
	relu	91.47	0.001895	95.90	0.002500
Sin	ortho	74.03	0.004879	94.22	0.004050
	rand(0,1)	38.12	0.004816	95.30	0.004491
	rand(-1,1)	40.59	0.005819	94.01	0.004349
	xavier	76.89	0.005369	93.89	0.004550
	relu	94.78	0.005162	94.88	0.004606
Cos	ortho	67.63	0.004991	95.33	0.004404
	rand(0,1)	33.88	0.017163	94.93	0.004892
	rand(-1,1)	36.73	0.004962	94.49	0.004495
	xavier	89.12	0.004869	94.78	0.008545
	relu	96.68	0.005868	94.46	0.005029
LeakyRelu	ortho	92.99	0.014102	92.19	0.005792
	rand(0,1)	88.72	0.008438	92.94	0.004327
	rand(-1,1)	93.44	0.004335	92.32	0.013124
	xavier	93.92	0.004438	92.43	0.012369
	relu	90.21	0.012439	93.19	0.013407
BentIde	ortho	92.99	0.006876	95.72	0.009425
	rand(0,1)	88.27	0.007857	96.29	0.004027
	rand(-1,1)	92.90	0.007002	96.93	0.003831
	xavier	94.16	0.007719	97.70	0.008557
	relu	94.99	0.007748	97.35	0.003863
ArcTan	ortho	91.48	0.003868	98.60	0.007082
	rand(0,1)	79.13	0.003311	97.66	0.012485
	rand(-1,1)	88.91	0.003581	97.97	0.005168
	xavier	91.81	0.006125	96.91	0.013778
	relu	92.80	0.003609	96.83	0.005014

Table 8. TABLE VIII: Accuracy comparison on Forest cover data-set with ELM for 200 hidden nodes

Activation	Weight	ELM		BF ELM
function	initialization	Acc	Test Time	Acc	Test Time
None	ortho	54.28	0.499435	54.28	0.622756
	rand(0,1)	54.27	0.500526	54.28	0.511839
	rand(-1,1)	54.28	0.471568	54.28	0.495267
	xavier	54.27	0.470973	54.28	0.498806
	relu	54.27	0.472855	54.28	0.502121
Relu	ortho	57.64	0.990403	57.16	0.842121
	rand(0,1)	54.28	0.598057	56.99	0.830652
	rand(-1,1)	57.52	0.921646	57.09	0.868510
	xavier	57.26	0.923671	57.83	0.811645
	relu	57.73	0.921243	56.81	0.835724
Sigmoid	ortho	58.49	0.754831	59.77	0.857107
	rand(0,1)	58.67	0.987742	58.97	0.836054
	rand(-1,1)	58.88	0.935462	59.43	0.840899
	xavier	58.03	0.757011	59.32	0.834694
	relu	58.96	0.742144	59.58	0.864178
Tanh	ortho	59.00	1.021268	59.47	1.106261
	rand(0,1)	56.28	1.303198	59.45	1.096017
	rand(-1,1)	57.97	1.240969	58.89	1.126781
	xavier	57.88	0.963619	58.88	1.113423
	relu	58.30	0.957834	59.77	1.101439
Softsign	ortho	57.95	0.579836	60.13	0.598289
	rand(0,1)	58.51	0.561542	59.19	0.590835
	rand(-1,1)	58.43	0.563180	59.88	0.588342
	xavier	58.38	0.564881	59.33	0.587214
	relu	58.48	0.562915	59.09	0.589696
Sin	ortho	58.42	0.620325	59.33	0.645977
	rand(0,1)	57.94	1.210180	59.01	0.646018
	rand(-1,1)	58.81	0.798120	59.01	0.636448
	xavier	58.52	0.613438	59.26	0.647153
	relu	58.14	0.618676	59.66	0.652874
Cos	ortho	58.34	0.648218	59.14	0.691882
	rand(0,1)	58.54	1.240182	59.82	0.695169
	rand(-1,1)	58.82	0.817066	59.39	0.686721
	xavier	59.15	0.678123	59.25	0.701660
	relu	58.93	0.645320	59.46	0.709892
LeakyRelu	ortho	58.80	1.772426	56.23	1.278287
	rand(0,1)	54.27	0.727510	56.33	1.349439
	rand(-1,1)	58.37	1.620844	57.27	1.290102
	xavier	57.90	1.603439	56.90	1.268446
	relu	58.07	1.518761	57.84	1.304729
BentIde	ortho	58.56	1.052638	59.65	1.071917
	rand(0,1)	59.08	1.055782	59.64	1.076950
	rand(-1,1)	58.96	1.049004	59.77	1.070770
	xavier	58.56	1.049435	58.81	1.073052
	relu	59.00	1.052567	59.30	1.079879
ArcTan	ortho	58.16	0.711534	60.49	0.771294
	rand(0,1)	58.87	0.912735	58.98	0.926173
	rand(-1,1)	58.19	0.823203	58.68	0.786256
	xavier	58.22	0.692458	59.00	0.845435
	relu	58.44	0.673110	59.68	0.750680

Table 9. TABLE IX: Accuracy comparison on MNIST test data set with ELM for 20 hidden nodes

Activation	Weight	ELM		BF ELM
function	initialization	Acc	Test Time	Acc	Test Time
None	ortho	64.32	0.068089	86.02	0.069997
	rand(0,1)	64.41	0.069332	85.97	0.068331
	rand(-1,1)	61.37	0.067602	86.03	0.068694
	xavier	60.07	0.067313	86.03	0.068067
	relu	64.18	0.068180	85.99	0.072153
Relu	ortho	54.71	0.071155	86.35	0.070987
	rand(0,1)	64.90	0.068160	85.85	0.068724
	rand(-1,1)	51.86	0.069475	85.73	0.069322
	xavier	54.15	0.068726	85.52	0.068729
	relu	55.00	0.068493	86.12	0.070499
Sigmoid	ortho	64.35	0.068715	86.88	0.069570
	rand(0,1)	11.33	0.069876	87.29	0.069305
	rand(-1,1)	54.65	0.068989	86.65	0.069786
	xavier	65.55	0.068768	87.56	0.069762
	relu	63.36	0.069838	86.86	0.069551
Tanh	ortho	64.07	0.070599	86.61	0.070682
	rand(0,1)	11.35	0.069093	86.72	0.070277
	rand(-1,1)	53.89	0.070037	86.94	0.070810
	xavier	63.46	0.070281	86.95	0.070624
	relu	58.14	0.070568	86.22	0.072531
Softsign	ortho	62.51	0.068898	86.97	0.069097
	rand(0,1)	62.58	0.068350	86.48	0.069852
	rand(-1,1)	52.77	0.069188	87.10	0.069221
	xavier	64.50	0.067966	87.14	0.068566
	relu	63.53	0.068463	86.13	0.069073
Sin	ortho	64.68	0.068719	86.98	0.069508
	rand(0,1)	11.05	0.071195	86.60	0.070107
	rand(-1,1)	13.67	0.072125	86.81	0.069073
	xavier	61.21	0.069393	86.81	0.069310
	relu	58.62	0.069062	86.87	0.071229
Cos	ortho	45.94	0.071003	86.24	0.068565
	rand(0,1)	11.10	0.070971	86.30	0.069599
	rand(-1,1)	13.20	0.070877	86.46	0.069696
	xavier	49.49	0.068486	86.72	0.070056
	relu	50.85	0.068768	86.67	0.069858
LeakyRelu	ortho	55.33	0.070158	86.29	0.069612
	rand(0,1)	65.24	0.068556	86.58	0.070003
	rand(-1,1)	50.45	0.069813	85.88	0.071857
	xavier	50.64	0.070322	85.97	0.070441
	relu	55.01	0.070688	86.94	0.071169
BentIde	ortho	61.97	0.068751	86.42	0.070512
	rand(0,1)	63.16	0.068898	86.23	0.069749
	rand(-1,1)	60.36	0.069847	87.07	0.071223
	xavier	62.28	0.068689	86.93	0.070224
	relu	60.99	0.069668	86.44	0.069452
ArcTan	ortho	61.83	0.068867	86.76	0.070357
	rand(0,1)	60.23	0.069516	87.07	0.070682
	rand(-1,1)	56.91	0.070821	86.91	0.069662
	xavier	64.44	0.069059	86.92	0.069933
	relu	63.51	0.069914	86.93	0.071118

Table 10. TABLE X: Accuracy comparison on Brain MRI data set with ELM for 10 hidden nodes

Activation	Weight	ELM		BF ELM
function	initialization	Acc	Test Time	Acc	Test Time
None	ortho	57.50	0.001152	100.00	0.001372
	rand(0,1)	52.50	0.000811	100.00	0.000814
	rand(-1,1)	50.00	0.000755	100.00	0.000893
	xavier	37.50	0.000803	100.00	0.000791
	relu	47.50	0.000705	100.00	0.000881
Relu	ortho	37.50	0.000779	100.00	0.000885
	rand(0,1)	45.00	0.000841	95.00	0.000909
	rand(-1,1)	57.50	0.000775	90.00	0.000812
	xavier	32.50	0.000770	90.00	0.000779
	relu	47.50	0.000743	87.50	0.000767
Sigmoid	ortho	65.00	0.000716	100.00	0.000926
	rand(0,1)	20.00	0.000821	100.00	0.000862
	rand(-1,1)	42.50	0.000775	92.50	0.000869
	xavier	35.00	0.000722	92.50	0.000854
	relu	55.00	0.000796	95.00	0.000892
Tanh	ortho	50.00	0.000745	97.50	0.000939
	rand(0,1)	20.00	0.000805	77.50	0.000856
	rand(-1,1)	27.50	0.000724	92.50	0.000985
	xavier	50.00	0.000832	87.50	0.000811
	relu	37.50	0.000718	57.50	0.000854
Softsign	ortho	50.00	0.000802	95.00	0.000929
	rand(0,1)	55.00	0.001009	92.50	0.000770
	rand(-1,1)	37.50	0.000752	87.50	0.000770
	xavier	47.50	0.000732	95.00	0.000893
	relu	37.50	0.000664	95.00	0.000770
Sin	ortho	17.50	0.000701	100.00	0.000811
	rand(0,1)	20.00	0.000872	95.00	0.000808
	rand(-1,1)	35.00	0.000785	100.00	0.000811
	xavier	45.00	0.000770	95.00	0.000863
	relu	32.50	0.000948	97.50	0.000798
Cos	ortho	52.50	0.000690	100.00	0.000817
	rand(0,1)	15.00	0.000855	97.50	0.000836
	rand(-1,1)	27.50	0.000719	95.00	0.000837
	xavier	62.50	0.000805	92.50	0.000837
	relu	40.00	0.000785	95.00	0.000817
LeakyRelu	ortho	57.50	0.000691	95.00	0.000797
	rand(0,1)	42.50	0.000697	92.50	0.000799
	rand(-1,1)	30.00	0.000692	100.00	0.000799
	xavier	37.50	0.000694	97.50	0.000831
	relu	47.50	0.000901	95.00	0.000807
BentIde	ortho	37.50	0.000722	100.00	0.000786
	rand(0,1)	52.50	0.000836	90.00	0.000803
	rand(-1,1)	57.50	0.000689	100.00	0.001013
	xavier	50.00	0.000763	100.00	0.000786
	relu	52.50	0.001027	90.00	0.000812
ArcTan	ortho	52.50	0.000728	100.00	0.000816
	rand(0,1)	50.00	0.000794	100.00	0.000821
	rand(-1,1)	50.00	0.000706	100.00	0.000812
	xavier	55.00	0.000723	97.50	0.000801
	relu	42.50	0.000760	97.50	0.000797

Equations28

o_{j} = i = 1 \sum M β_{i} . g (w_{i} . x_{j} + b_{i}), for j = 1, \dots, N

o_{j} = i = 1 \sum M β_{i} . g (w_{i} . x_{j} + b_{i}), for j = 1, \dots, N

E = ∣∣ j = 1 \sum N (i = 1 \sum M β_{i} . g (w_{i} . x_{j} + b_{i})) - t_{j} ∣∣

E = ∣∣ j = 1 \sum N (i = 1 \sum M β_{i} . g (w_{i} . x_{j} + b_{i})) - t_{j} ∣∣

\begin{array}[]{llll}&||\sum_{j=1}^{N}(\sum_{i=1}^{M}\beta_{i}.g(w_{i}.x_{j}+b_{i}))-t_{j}||&=&0\\ &&&\\ \Rightarrow&\sum_{i=1}^{M}\beta_{i}g(w_{i}.x_{j}+b_{i})&=&t_{j}\\ &\textrm{ for all }j=1,\dots,N\\ \end{array}

\begin{array}[]{llll}&||\sum_{j=1}^{N}(\sum_{i=1}^{M}\beta_{i}.g(w_{i}.x_{j}+b_{i}))-t_{j}||&=&0\\ &&&\\ \Rightarrow&\sum_{i=1}^{M}\beta_{i}g(w_{i}.x_{j}+b_{i})&=&t_{j}\\ &\textrm{ for all }j=1,\dots,N\\ \end{array}

H β = T

H β = T

\begin{array}[]{l}\begin{array}[]{c|ccc|}&g(w_{1}.x_{1}+b_{1})&\ldots&g(w_{M}.x_{1}+b_{M})\\ &.&\ldots&.\\ H=&.&\ldots&.\\ &.&\ldots&.\\ &g(w_{1}.x_{N}+b_{1})&\ldots&g(w_{M}.x_{N}+b_{M})\\ \end{array}\\ \\ \begin{array}[]{c|c|}&\beta_{1}\\ &.\\ \beta=&.\\ &.\\ &\beta_{M}\\ \end{array}\textrm{ and}\begin{array}[]{c|c|}&t_{1}\\ &.\\ T=&.\\ &.\\ &t_{N}\\ \end{array}\end{array}

\begin{array}[]{l}\begin{array}[]{c|ccc|}&g(w_{1}.x_{1}+b_{1})&\ldots&g(w_{M}.x_{1}+b_{M})\\ &.&\ldots&.\\ H=&.&\ldots&.\\ &.&\ldots&.\\ &g(w_{1}.x_{N}+b_{1})&\ldots&g(w_{M}.x_{N}+b_{M})\\ \end{array}\\ \\ \begin{array}[]{c|c|}&\beta_{1}\\ &.\\ \beta=&.\\ &.\\ &\beta_{M}\\ \end{array}\textrm{ and}\begin{array}[]{c|c|}&t_{1}\\ &.\\ T=&.\\ &.\\ &t_{N}\\ \end{array}\end{array}

∣∣ \hat{H} \hat{β} - T ∣∣ ≃ ∣∣ H β - T ∣∣

∣∣ \hat{H} \hat{β} - T ∣∣ ≃ ∣∣ H β - T ∣∣

\begin{array}[]{l}\begin{array}[]{c|ccc|}&\hat{g}(\hat{w}_{1}.x_{1}+\hat{b}_{1})&\ldots&\hat{g}(\hat{w}_{M}.x_{1}+\hat{b}_{M})\\ &.&\ldots&.\\ \hat{H}=&.&\ldots&.\\ &.&\ldots&.\\ &\hat{g}(\hat{w}_{1}.x_{N}+\hat{b}_{1})&\ldots&\hat{g}(\hat{w}_{M}.x_{N}+\hat{b}_{M})\\ \end{array}\\ \\ \textrm{, and}\begin{array}[]{c|c|}&\hat{\beta}_{1}\\ &.\\ \hat{\beta}=&.\\ &.\\ &\hat{\beta}_{M}\\ \end{array}\end{array}

\begin{array}[]{l}\begin{array}[]{c|ccc|}&\hat{g}(\hat{w}_{1}.x_{1}+\hat{b}_{1})&\ldots&\hat{g}(\hat{w}_{M}.x_{1}+\hat{b}_{M})\\ &.&\ldots&.\\ \hat{H}=&.&\ldots&.\\ &.&\ldots&.\\ &\hat{g}(\hat{w}_{1}.x_{N}+\hat{b}_{1})&\ldots&\hat{g}(\hat{w}_{M}.x_{N}+\hat{b}_{M})\\ \end{array}\\ \\ \textrm{, and}\begin{array}[]{c|c|}&\hat{\beta}_{1}\\ &.\\ \hat{\beta}=&.\\ &.\\ &\hat{\beta}_{M}\\ \end{array}\end{array}

\hat{β} = \hat{H}^{†} . T = (\hat{H}^{'} . \hat{H})^{- 1} . \hat{H}^{'} . T

\hat{β} = \hat{H}^{†} . T = (\hat{H}^{'} . \hat{H})^{- 1} . \hat{H}^{'} . T

H = T \times β^{†} + Random Error

H = T \times β^{†} + Random Error

W = I^{†} * H

W = I^{†} * H

W = [W, orth (W)]

W = [W, orth (W)]

H = g (I \times W)

H = g (I \times W)

β = H^{†} \times T

β = H^{†} \times T

y(x)=\left\{\begin{array}[]{ll}sin(x)/x&x\neq 0\\ 1&x=0\\ \end{array}\right.

y(x)=\left\{\begin{array}[]{ll}sin(x)/x&x\neq 0\\ 1&x=0\\ \end{array}\right.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and ELM · Neural Networks and Applications · Advanced Memory and Neural Computing

Full text

Backward-Forward Algorithm: An Improvement towards Extreme Learning Machine

Dibyasundar Das, Deepak Ranjan Nayak, Ratnakar Dash, and Banshidhar Majhi Dibyasundar Das, Deepak Ranjan Nayak, Ratnakar Dash and Banshidhar Majhi is with the Department of Computer Science and Engineering, National Institute of Technology Rourkela, Odisha, India, 769008 e-mail: ([email protected]).

github link: https://github.com/Dibyasundar/BackwardForwardELM

Abstract

The extreme learning machine needs a large number of hidden nodes to generalize a single hidden layer neural network for a given training data-set. The need for more number of hidden nodes suggests that the neural-network is memorizing rather than generalizing the model. Hence, a supervised learning method is described here that uses Moore-Penrose approximation to determine both input-weight and output-weight in two epochs, namely, backward-pass and forward-pass. The proposed technique has an advantage over the back-propagation method in terms of iterations required and is superior to the extreme learning machine in terms of the number of hidden units necessary for generalization.

Index Terms:

Extreme Learning Machine, Single Layer Feed-forward Network , Image classification.

I Introduction

Machine learning is one of the key element to many of the real-world applications like text recognition [1, 2, 3], speech recognition [4, 5], automated CAD system [6, 7, 8], defense[9, 10], industry [11, 12], behavioral analysis[13], marketing[14] etc. Among the learning models, the neural network is well known for its flexibility in the choice of architecture and approximation ability. The Single hidden feed-forward neural network (SLFN) architecture is a widely used model for handling prediction and pattern classification problems. However, the weight and bias of these neural networks are popularly tuned using the gradient-based optimizer. These methods are known to be slow due to the improper choice of learning rate and may converge to local minima. Moreover, the learning iterations add computational cost to the tuning process of the model. Oppose to traditional methods randomized algorithms for training single layer feed-forward neural networks such as extreme learning machine (ELM) [15] and radial basis function network (RBFN) [16], have become a popular choice in recent years because of their generalization capability with faster learning speed [17, 18, 19]. Huang et al. [15] have proposed ELM that takes advantage of the random transformation of the input feature to learn a generalized model in one iteration. In this method, the input weight and bias are chosen randomly for a given SLFN architecture, and output weight is analytically determined with the generalized inverse operation.

On the other hand, RBFN uses distance-based random feature mapping (centers of RBFs are generated randomly). However, RBFN obtains an unsatisfactory solution for some cases and results in poor generalization [20]. Hence, ELM provides effective solution for SLFNs with good generalization and extreme fast leaning, thereby, has been widely applied in various applications like regression [21], data classification [15, 21], image segmentation [22], dimension reduction [23], medical image classification [24, 25, 26], face classification [27], etc. In [21], Huang et al. discussed the universal approximation capability and scalability of ELM. The accuracy of classification in ELM depends on the choice of weight initialization scheme and activation function. To overcome this shortcoming, many researchers have used optimization algorithms that choose the best weight for the input layer. However, with the introduction of heuristic optimization, the choice of iteration and hyper-parameters are again introduced. Hence, such methods suffer from the same problem as the back-propagation based neural network. Thus here, we propose a non-iterative and non-parametric method that overcome the limitations of ELM and iterative-ELM. The main contribution of the paper is to develop a non-iterative and non-parametric algorithm, namely backward-forward ELM, to train a single hidden layer neural network.

A comprehensive study of the proposed model on many of standard machine learning classification and prediction applications. As well as, two well know image classification data-sets, namely MNIST and Brain-MRI, are studied for non-handcrafted feature evaluation. The rest of the paper is organized as follows. Section II gives an overview of the motivation and objective behind the development of the ELM algorithm and its limitations. In the next section, the proposed backward-forward ELM algorithm is described in brief. Section IV summarizes the experiments conducted, and finally, Section V concludes the study.

II Extreme Learning Machine

Feed-forward Neural network is slow due to gradient-based weight learning and the requirement of parameter tuning. The extreme learning machine is one of the learning models for the single hidden feed-forward neural network (SLFN) where the input-weights are randomly chosen, and the output-weights are determined analytically. This makes the network to converge to the underlying regression layer in one pass, which is a faster learning algorithm than the traditional gradient-based algorithms. The development of the ELM algorithm is based on the assumption that input weight and bias do not create much difference in obtained accuracy, and a minimum error is acceptable if many computational steps can be avoided. However, the accuracy and generalization capability highly depends on the learning of the output-weight and minimization of output-weight norm.

The approximation problem can be expressed as follows;

For $N$ distinct samples $(x_{j},t_{j})$ , $M$ hidden neurons and $g(.)$ be the activation function, so the output of SLFN can be modeled as:

[TABLE]

Hence, the error ( $E$ ) for the target output ( $t$ ) is $\sum_{j=1}^{N}||o_{j}-t_{j}||$ and it can be expressed as;

[TABLE]

For an ideal approximation case error is zero. Hence,

[TABLE]

This equation can be expressed as

[TABLE]

where,

[TABLE]

If given N==M (i.e., the sample size is the same as the number of the hidden neurons); the matrix H is square and invertible if its determinant is nonzero. In such a case, the SLFN can approximate with zero error. But in reality, $M<<N$ hence $\beta$ is not invertible. Hence rather finding an exact solution, we try to find a near-optimal solution that minimizes the approximation error. Which can be expressed as;

[TABLE]

$\hat{H}$ and $\hat{\beta}$ can be defined as

[TABLE]

In any learning method for SLFN we try to find $\hat{w}$ , $\hat{b}$ , $\hat{g}(.)$ and $\hat{\beta}$ in order to minimize the error of prediction. Mostly $\hat{g}(.)$ is chosen as a continuous function depending on the model consideration of data (various activation functions are Sigmoid, tan-hyperbolic, ReLU, etc.). The $\hat{w}$ , $\hat{b}$ and $\hat{\beta}$ are to be determined by the learning algorithm. Back-propagation is one of the most famous learning algorithms that use the gradient descent method. However, the gradient-based algorithms have the following issues associated with them:

Choosing proper learning rate $\eta$ value. Small $\eta$ converges very slowly, and Very high value of $\eta$ makes the algorithm unstable. 2. 2.

The gradient-based learning some times may converge to local minima, which is undesirable if the difference between global minima and local minima is significantly large. 3. 3.

Some times overtraining leads to worse generalization, hence proper stopping criteria are also needed. 4. 4.

Gradient-based learning is very time-consuming.

For above reasons the ELM chooses $\hat{w}$ , $\hat{b}$ randomly and uses MP inverse to calculate $\hat{\beta}$ analytically. Hence $\hat{\beta}$ can be expressed as

[TABLE]

**Drawbacks of ELM:

**

Das et al. [28] have studied deeply on the behavior of the ELM, for various weight initialization schemes, activation functions, and the number of nodes. From this study, it is found that ELM has limitations as follows.

•

The accuracy of classification in ELM depends on the choice of weight initialization scheme and activation function.

•

It is observed that the ELM needs relatively higher hidden nodes to provide higher accuracy. The need for more hidden nodes, suggests the network is memorizing the samples rather than providing a generalized performance.

•

It is also observed that due to random weights in the final network, ELM suffers from ill-posed problems.

To overcome these shortcomings, many researchers have used optimization algorithms [29, 6], which choose the best weight for the input layer. However, such a solution again introduces the iteration and choice of parameter problem for the optimization scheme. Hence, this paper proposes a backward-forward method for a single hidden layer neural network which has the following advantages over other learning models:

•

The algorithm generalizes the network with few hidden layer nodes only in two steps. In, the first step (backward pass), the input weights are evaluated, and in the second step (forward pass), the suitable output weight is determined.

•

The final model of the network does not contain any random weights, thus giving a consistent result even when the choice of activation changes.

•

Unlike optimization-based ELM, the proposed method evaluates input weight in two steps. Hence the model does not need iterative steps.

III Proposed backward forward algorithm for ELM

In this section, we discussed the learning process of the proposed model. In the architecture of a single hidden layer neural network, there are two types of weights to learn, namely input-weight (weight matrix that represents connection from input to hidden layer) and output-weight (weight matrix that represent connection from hidden to output layer). The proposed model has two major stages, namely backward-pass (where input weights are learned), and forward-pass (where output weights are determined). We made the following assumption to develop the proposed backward forward algorithm for ELM (BF-ELM) algorithm.

•

The weights in the neural network can be categories into two parts. Some of the weights generalize the model, and the rest of the weights is used to memorize the samples. Hence, in backward-pass, the BF-ELM determines half of the weights that are assumed to generalize the model for a given training data-set.

•

If a learned model uses linear activation and the activation is replaced, it will not affect the accuracy of the model. Hence, in backward-pass, the model assumes linear activation, and proper activation is replaced in forward-pass.

•

If the input training set ( $I$ ) and hidden layer output ( $H$ ) is augmented, then bias can be ignored.

Both of the stages are described in detail in the following sections.

III-A Backward-pass

In backward-pass the model learns a subset of input-weight using Moore-Penrose inverse in direction from output to input. For, a given a training set $\mathcal{N}=\{(x_{j},t_{j})~{}|~{}x_{j}\epsilon\mathbb{R}^{P},~{}t_{j}\epsilon\mathbb{R}^{C},~{}\text{where }j=1,2,\ldots,N\}$ we design a SLFN with $M/2$ hidden nodes which determines a subset of input-weight ( $\widetilde{W}\text{ of size }(P,M/2)$ ) as follows.

The output-weight $\widetilde{\beta}$ of size( $M/2,c$ ) is set randomly. 2. 2.

The hidden layer output matrix is determined using following equation.

[TABLE] 3. 3.

The subset of input-weight( $W$ ) is determined by following equation.

[TABLE] 4. 4.

The learned subset ( $\widetilde{W}$ ) is used to determine full input-weight ( $W\text{ of size }(P,M)$ )by appending orthogonal transformation of $\widetilde{W}$ as follows.

[TABLE]

III-B Forward-pass

in next stage the learned input-weight ( $W$ ) is used to find output-weight ( $\hat{\beta}$ ) is determined in forward-pass.

The hidden-layer is determined by using $W$ .

[TABLE]

where, g(.) is the activation function. 2. 2.

Finally, the output-weight is determined as follows:

[TABLE]

The over all diagram of the proposed BF-ELM model is given in Fig. 1, which shows the determination of input weight ( $W$ ) and output weight( $\beta$ ). In, next section various experiments have been carried out on multiple image classification data-set that shows the learning capability of the BF-ELM. The proposed algorithm needs fewer number of nodes compared to ELM to achieve better generalization performance.

IV Performance evaluation

In this section performance of the proposed BF-ELM is compared with ELM on various benchmark data-sets. The comparison is made with respect to the number of hidden neurons required for generalized performance and the time needed to compute the output for the testing set. All implementations of BF-ELM and ELM are carried out in MATLAB 2018b running on an i7-4710HQ processor with Ubuntu operating system. The pseudoinverse (†) operation is done using MATLAB in-built function, and the ELM implementation is done following the paper [15]. The experiment conducted can be divided in two-part; the first experiment compares the two algorithms on the basis of hidden nodes required, and the second experiment observes the behavior of models with respect to change in weight initialization scheme and activation function as described in TABLE II and III respectively. The test is conducted for each combination of weight initialization scheme and activation function.

The brief description of the benchmark data-sets and the result analysis are given as follows:

IV-A Benchmark with sine cardinal regression problems

The approximation function sine cardinal (as given in equation 14) is used to test the proposed learning model.First 5000 data points for training set are generated, where $x$ is randomly distributed over [-10,10] with additive random error of uniform distribution of [-0.2,0.2] to response $y$ . The testing set is created without using any additive error.

[TABLE]

The experiment to analyze the hidden nodes required to solve the regression problem for both ELM and BF-ELM is done. During the experiment the activation is set to sin function. The obtained root-mean-squared-error (RMSE) is depicted in Fig. 2.

The RMSE decreases while increasing the number of hidden nodes. It is observed that the BF-ELM minimizes error with less number of hidden nodes up to 12 nodes then ELM result are superior and equilibrium point is achieved with 17 or more hidden nodes. The effect of various activation function and weight initialization scheme is summarized in TABLE IV with respect to root mean square error (RMSE) and testing time. The hidden nodes for both ELM and BF-ELM algorithms are set to 10.

TABLE IV shows the minimum RMSE value in every weight initialization and activation function combination. As the architecture for SLFN remains constant for both ELM and BF-ELM, hence, the testing time for both algorithms are nearly similar. Fig. 3 shows the approximated function learned by ELM and BF-ELM for the input training data. The best result was obtained with four hidden nodes and sinc activation. BF-ELM learned the approximated values nearly to actual expected value of generalization.

IV-B Benchmark with iris data-set

The iris data-set use multiple measurements like sepal length, sepal width, petal length, petal width to classify taxonomy of 3 different species of iris namely Setosa, Versicolor, and Virginica. The data-set contains 50 samples per each class and the data-set is divided in to 70:30 training and testing set respectively. The accuracy increases with number of hidden-node and the performance comparison of BF-ELM and ELM with respect to number of hidden nodes is given in Fig. 4.

The best accuracy for BF-ELM is obtained with six hidden nodes with orthogonal weight initialization and sigmoid activation function. Hence, further analysis of choice of weight initialization method and activation function is done with six hidden nodes. The summary of the analysis is given in TABLE V.

It is observed from TABLE V that BF-ELM provides optimal accuracy for all combination of weight initialization scheme and activation function which shows the superior performance of BF-ELM to ELM. Further, studied are made with medium size and large complex applications.

IV-C Benchmark with Satimage and Shuttle data-set

Satimage is one of medium size data-set having 4435 training and 2000 testing samples. The data-set contains 4 spectral band of $3\times 3$ neighborhood i.e. 36 predictive attributes for each sample and identified into seven classes namely red soil, cotton crop, gray soil, damp gray soil, soil with vegetation stubble, mixture class, and very damp gray soil. The testing accuracy with respect to number of hidden nodes is analyzed and summarized in Fig. 5. Similarly shuttle data-set consists of training sample count of 43,500 and testing size of 14,500 with nine attributes. The data-set have 7 classes namely Rad flow, Fpv close, Fpv open, High, Bypass, Bpv close, and Bpv open. The Fig. 6 depicts comparison of BF-ELM and ELM on testing set of shuttle data-set with respect to number of nodes.

The Fig. 5 shows that BF-ELM converges to optimum accuracy with few nodes as compared to ELM. The testing accuracy obtained by BF-ELM is superior for every count of nodes up to 100 nodes. Further, testing is done with respect to variation of activation function and weight initialization scheme. The obtained results for 20 hidden nodes is given in TABLE VI.

The Fig. 6 shows superiority of BF-ELM over ELM in achieving testing accuracy. The summary of analysis over choice of weight initialization scheme and activation function is given in TABLE VII.

It is observed from TABLE VI that orthogonal initialization with sigmoid activation function and random initialization with arc-tan function gives best result for Satimage data-set. Similarly, TABLE VII shows that the best results are obtained with combination of random(-1,1) weight initialization with sigmoid activation function, orthogonal weights with soft-sign function, and orthogonal weights with arc-tan function achieves best performance score. The following sections depicts study that are carried out on large and complex data-sets.

IV-D Benchmark with large forest cover data-set

The proposed model is also tested for very large data-set of forest-cover type prediction application. The said data-set presents an extremely large prediction problem with seven classes. It contains 5,81,012 samples with 54 attributes randomly permuted over seven class namely spruce-fir, lodgepole pine, ponderosa pine, willow, aspen, doglas-fir and krummholz. The data-set is divided in to training and testing samples in accordance with the suggestion given in data-set description i.e. first 15,120 samples are used as training and rest 5,65,892 samples are used as testing. First experiment is conducted to study the effect of number of hidden nodes on both ELM and BF-ELM algorithms. The results obtained can be visualized in Fig. 7.

The Fig. 7 shows that the accuracy on testing set increases for both ELM and BF-ELM algorithm. The figure depicts performance of both algorithms up to 2000 nodes and for each experiment conducted by increasing nodes, the accuracy obtained by BF-ELM is more than that of ELM. In second experiment the effect of weight initialization scheme and activation function is studied. For this experiment the number of nodes was set to 200 and orthogonal weight initialization scheme with sigmoid activation function is used for both algorithms. The obtained results are given in TABLE VIII. The table shows that for every combination of weight initialization scheme and activation function, the accuracy obtained by BF-ELM is superior to ELM.

Further study are carried out on image data-sets, where the pixels are directly used as feature input to SLFN. This represents learning non-handcrafted feature directly from raw training images. The next two section presents performance study of MNIST and Brain-MRI data-set respectively.

IV-E Benchmark with MNIST digit data-set

Modified National Institute of Standards and Technology (MNIST) hand-written digit data-set is a standard for training and testing in the field of machine learning since 1999. The data-set consists of 60000 training and 10000 testing samples. The images have already been normalized to size $28\times 28$ and presented in vector format. The Fig. 8 shows some of the samples in MNIST data-set.

The first experiment is conducted to study the performance of ELM and BF-ELM with respect to number of nodes. The Fig. 9 represents the accuracy comparison of ELM and BF-ELM with orthogonal weight initialization and sigmoid activation function.

From Fig. 9 it is observed that BF-ELM achieves superior result with 20 hidden nodes and the testing accuracy keeps increasing with increase of hidden nodes. In second experiment the weight initialization and activation function are studied. The performance obtained for BF-ELM and ELM is represented in TABLE IX. The experiment shows that BF-ELM achieves better accuracy in every combination. The best performance is achieved with xavier weight initialization and sigmoid activation function. The next experiment is carried out on pathological brain-MRI dataset which has gray level intensities in each image.

IV-F Benchmark with Multiclass brain MRI data-set:

The multiclass brain MR dataset comprises 200 images (40 normal and 160 pathological brain images) is used to evaluate the proposed model. The pathological brains contain diseases of four categories, namely brain stroke, degenerative, infectious and brain tumor; each category holds 40 images. The images are re-scaled to $80\times 80$ before applying to network directly. The Fig. 10 shows some of the samples in brain-MRI dataset. The training and testing set is obtained by 80:20 stratified division.

The Fig. 11 shows the results obtained during first experiment. In this the testing accuracy obtained by BF-ELM is compared to ELM with increasing number of hidden nodes up to 20. The experiment is carried out with orthogonal weight initialization scheme and sigmoid activation function. Here, it is observed that BF-ELM achieves best accuracy with 9 hidden nodes.

The results of second experiment is summarized in TABLE X which depicts the effect of various weight initialization scheme and activation function for learning in SLFN using ELM and BF-ELM for 10 hidden nodes.

The above experiments highlight the performance improvement of SLFN learned by BF-ELM to SLFN learned by ELM. As there is only two pass in BF-ELM while ELM has one pass leaning, the proposed model takes twice the training time of ELM. However, the advantage BF-ELM is that the final network does not contain any random weights. Moreover, in many of the applications discussed above BF-ELM achieve better performance with less number of hidden nodes.

V Conclusion

This paper proposes a backward-forward algorithm for single hidden layer neural network which is a modified version of extreme learning machine. The proposed model performs better compared to ELM with fewer hidden nodes. Further, the evaluation of model with respect to weight various initialization scheme and activation functions proves the stability of the model as variance in the accuracy obtained for testing set is small compared to ELM. The proposed model can be directly used as classifier or can be used as a weight initialization model for fine tuning using gradient based model. In future, the model can be extended to multi layer neural network and convolutional neural network.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Mori, C. Y. Suen, and K. Yamamoto, “Historical review of OCR research and development,” in Document Image Analysis . IEEE Computer Society Press, 1995, pp. 244–273.
2[2] Y. Alginahi, Preprocessing techniques in character recognition . INTECH Open Access Publisher, 2010.
3[3] R. K. Mohapatra, B. Majhi, and S. K. Jena, “Classification of handwritten Odia basic character using stockwell transform,” International Journal of Applied Pattern Recognition , vol. 2, no. 3, pp. 235–254, 2015.
4[4] K.-S. Fu, Applications of pattern recognition . CRC press, 2019.
5[5] S. Lokesh, P. Malarvizhi Kumar, M. Ramya Devi, P. Parthasarathy, and C. Gokulnath, “An automatic tamil speech recognition system by using bidirectional recurrent neural network with self-organizing map,” Neural Computing and Applications , vol. 31, no. 5, pp. 1521–1531, May 2019.
6[6] D. R. Nayak, R. Dash, and B. Majhi, “Discrete ripplet-ii transform and modified pso based improved evolutionary extreme learning machine for pathological brain detection,” Neurocomputing , vol. 282, pp. 232 – 247, 2018.
7[7] S. Beura, B. Majhi, and R. Dash, “Mammogram classification using two dimensional discrete wavelet transform and gray-level co-occurrence matrix for detection of breast cancer,” Neurocomputing , vol. 154, pp. 1 – 14, 2015.
8[8] S. Mishra, B. Majhi, and P. K. Sa, “Texture feature based classification on microscopic blood smear for acute lymphoblastic leukemia detection,” Biomedical Signal Processing and Control , vol. 47, pp. 303 – 311, 2019.