Distributed dual vigilance fuzzy adaptive resonance theory learns   online, retrieves arbitrarily-shaped clusters, and mitigates order dependence

Leonardo Enzo Brito da Silva; Islam Elnabarawy; Donald C. Wunsch II

arXiv:1901.00794·cs.NE·January 4, 2019

Distributed dual vigilance fuzzy adaptive resonance theory learns online, retrieves arbitrarily-shaped clusters, and mitigates order dependence

Leonardo Enzo Brito da Silva, Islam Elnabarawy, Donald C. Wunsch II

PDF

1 Repo

TL;DR

This paper introduces DDVFA, an ART-based modular clustering system that learns online, effectively retrieves arbitrarily-shaped clusters, and reduces order dependence through preprocessing or postprocessing methods, outperforming or matching other clustering algorithms.

Contribution

The paper proposes DDVFA, a novel distributed dual vigilance fuzzy ART architecture with mechanisms to mitigate order dependence and improve online clustering of complex data shapes.

Findings

01

DDVFA outperforms other ART-based systems in online mode with random data presentation.

02

DDVFA achieves performance comparable to non-ART algorithms like DBSCAN, HAC, and k-means.

03

The system effectively retrieves arbitrarily-shaped clusters and controls cluster compactness.

Abstract

This paper presents a novel adaptive resonance theory (ART)-based modular architecture for unsupervised learning, namely the distributed dual vigilance fuzzy ART (DDVFA). DDVFA consists of a global ART system whose nodes are local fuzzy ART modules. It is equipped with the distinctive features of distributed higher-order activation and match functions, using dual vigilance parameters responsible for cluster similarity and data quantization. Together, these allow DDVFA to perform unsupervised modularization, create multi-prototype clustering representations, retrieve arbitrarily-shaped clusters, and control its compactness. Another important contribution is the reduction of order-dependence, an issue that affects any agglomerative clustering method. This paper demonstrates two approaches for mitigating order-dependence: preprocessing using visual assessment of cluster tendency (VAT) or…

Equations34

T_{j} = \frac{∣ x \land w _{j} ∣}{α + ∣ w _{j} ∣},

T_{j} = \frac{∣ x \land w _{j} ∣}{α + ∣ w _{j} ∣},

M_{j} = \frac{∣ x \land w _{j} ∣}{∣ x ∣},

M_{j} = \frac{∣ x \land w _{j} ∣}{∣ x ∣},

ν : M_{j} \geq ρ,

ν : M_{j} \geq ρ,

w_{j}^{n e w} = (1 - β) w_{j}^{o l d} + β (x \land w_{j}^{o l d}),

w_{j}^{n e w} = (1 - β) w_{j}^{o l d} + β (x \land w_{j}^{o l d}),

ρ_{b} = \frac{1}{2} (ρ_{a} + 1),

ρ_{b} = \frac{1}{2} (ρ_{a} + 1),

T_{j} = 1 - \frac{∣ ( x \land w _{j} ) - w _{j} ∣}{∣ x ∣} .

T_{j} = 1 - \frac{∣ ( x \land w _{j} ) - w _{j} ∣}{∣ x ∣} .

T^{A R T_{i}^{(1)}} = f (T_{1}^{A R T_{i}^{(1)}}, T_{2}^{A R T_{i}^{(1)}}, ..., T_{k}^{A R T_{i}^{(1)}}),

T^{A R T_{i}^{(1)}} = f (T_{1}^{A R T_{i}^{(1)}}, T_{2}^{A R T_{i}^{(1)}}, ..., T_{k}^{A R T_{i}^{(1)}}),

T_{j}^{A R T_{i}^{(1)}} = \frac{∣ x \land w _{j}^{A R T_{i}^{(1)}} ∣}{α + ∣ w _{j}^{A R T_{i}^{(1)}} ∣}^{γ}, j \in {1, ..., k},

T_{j}^{A R T_{i}^{(1)}} = \frac{∣ x \land w _{j}^{A R T_{i}^{(1)}} ∣}{α + ∣ w _{j}^{A R T_{i}^{(1)}} ∣}^{γ}, j \in {1, ..., k},

M^{A R T_{i}^{(1)}} = g (M_{1}^{A R T_{i}^{(1)}}, M_{2}^{A R T_{i}^{(1)}}, ..., M_{k}^{A R T_{i}^{(1)}}),

M^{A R T_{i}^{(1)}} = g (M_{1}^{A R T_{i}^{(1)}}, M_{2}^{A R T_{i}^{(1)}}, ..., M_{k}^{A R T_{i}^{(1)}}),

M_{j}^{A R T_{i}^{(1)}} = \frac{∣ x \land w _{j}^{A R T_{i}^{(1)}} ∣}{∣ x ∣}^{γ}, j \in {1, ..., k} .

M_{j}^{A R T_{i}^{(1)}} = \frac{∣ x \land w _{j}^{A R T_{i}^{(1)}} ∣}{∣ x ∣}^{γ}, j \in {1, ..., k} .

M_{j}^{A R T_{i}^{(1)}} = \frac{∣ w _{j}^{A R T_{i}^{(1)}} ∣}{∣ x ∣}^{γ^{*}} T_{j}^{A R T_{i}^{(1)}}, j \in {1, ..., k}

M_{j}^{A R T_{i}^{(1)}} = \frac{∣ w _{j}^{A R T_{i}^{(1)}} ∣}{∣ x ∣}^{γ^{*}} T_{j}^{A R T_{i}^{(1)}}, j \in {1, ..., k}

t_{i, j} = \frac{∣ w _{j}^{A R T_{l}^{(1)}} \land w _{i}^{A R T_{k}^{(2)}} ∣}{α + ∣ w _{i}^{A R T_{k}^{(2)}} ∣}^{γ},

t_{i, j} = \frac{∣ w _{j}^{A R T_{l}^{(1)}} \land w _{i}^{A R T_{k}^{(2)}} ∣}{α + ∣ w _{i}^{A R T_{k}^{(2)}} ∣}^{γ},

m_{i, j} = \frac{∣ w _{i}^{A R T_{k}^{(2)}} ∣}{∣ w _{j}^{A R T_{l}^{(1)}} ∣}^{γ^{*}} t_{i, j} .

m_{i, j} = \frac{∣ w _{i}^{A R T_{k}^{(2)}} ∣}{∣ w _{j}^{A R T_{l}^{(1)}} ∣}^{γ^{*}} t_{i, j} .

A R = \frac{( 2 N ) ( tp + t n ) - [ ( tp + f p ) ( tp + f n ) + ( f n + t n ) ( f p + t n ) ]}{( 2 N ) ^{2} - [ ( tp + f p ) ( tp + f n ) + ( f n + t n ) ( f p + t n ) ]},

A R = \frac{( 2 N ) ( tp + t n ) - [ ( tp + f p ) ( tp + f n ) + ( f n + t n ) ( f p + t n ) ]}{( 2 N ) ^{2} - [ ( tp + f p ) ( tp + f n ) + ( f n + t n ) ( f p + t n ) ]},

M_{γ}^{n} = (max (M_{γ^{*}}) - min (M_{γ^{*}})) (\frac{M _{γ} - min ( M _{γ} )}{max ( M _{γ} ) - min ( M _{γ} )}) + min (M_{γ^{*}}) .

M_{γ}^{n} = (max (M_{γ^{*}}) - min (M_{γ^{*}})) (\frac{M _{γ} - min ( M _{γ} )}{max ( M _{γ} ) - min ( M _{γ} )}) + min (M_{γ^{*}}) .

M_{γ}^{n} = max (M_{γ^{*}})) (\frac{M _{γ}}{max ( M _{γ} )}) = (\frac{∣ w \land w ∣}{∣ x ∣})^{γ^{*}} \frac{\frac{∣ x \land w ∣}{∣ x ∣}}{\frac{∣ w \land w ∣}{∣ x ∣}}^{γ} = (\frac{∣ w ∣}{∣ x ∣})^{γ^{*}} (\frac{∣ x \land w ∣}{c + ∣ w ∣})^{γ},

M_{γ}^{n} = max (M_{γ^{*}})) (\frac{M _{γ}}{max ( M _{γ} )}) = (\frac{∣ w \land w ∣}{∣ x ∣})^{γ^{*}} \frac{\frac{∣ x \land w ∣}{∣ x ∣}}{\frac{∣ w \land w ∣}{∣ x ∣}}^{γ} = (\frac{∣ w ∣}{∣ x ∣})^{γ^{*}} (\frac{∣ x \land w ∣}{c + ∣ w ∣})^{γ},

M_{γ}^{n} = (\frac{∣ w ∣}{∣ x ∣})^{γ^{*}} T_{γ},

M_{γ}^{n} = (\frac{∣ w ∣}{∣ x ∣})^{γ^{*}} T_{γ},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ACIL-Group/DDVFA
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Distributed dual vigilance fuzzy adaptive resonance theory learns online, retrieves arbitrarily-shaped clusters, and mitigates order dependence

Leonardo Enzo Brito da Silva

[email protected]

Islam Elnabarawy

Donald C. Wunsch II

Applied Computational Intelligence Laboratory, Department of Electrical and Computer Engineering,

Missouri University of Science and Technology, Rolla, MO 65409 USA.

Applied Computational Intelligence Laboratory, Department of Computer Science,

Missouri University of Science and Technology, Rolla, MO 65409 USA.

CAPES Foundation, Ministry of Education of Brazil, Brasília, DF 70040-020, Brazil.

Abstract

This paper presents a novel adaptive resonance theory (ART)-based modular architecture for unsupervised learning, namely the distributed dual vigilance fuzzy ART (DDVFA). DDVFA consists of a global ART system whose nodes are local fuzzy ART modules. It is equipped with the distinctive features of distributed higher-order activation and match functions, using dual vigilance parameters responsible for cluster similarity and data quantization. Together, these allow DDVFA to perform unsupervised modularization, create multi-prototype clustering representations, retrieve arbitrarily-shaped clusters, and control its compactness. Another important contribution is the reduction of order-dependence, an issue that affects any agglomerative clustering method. This paper demonstrates two approaches for mitigating order-dependence: preprocessing using visual assessment of cluster tendency (VAT) or postprocessing using a novel Merge ART module. The former is suitable for batch processing, whereas the latter can be used in online learning. Experimental results in the online learning mode carried out on 30 benchmark data sets show that DDVFA cascaded with Merge ART statistically outperformed the best other ART-based systems when samples were randomly presented. Conversely, they were found to be statistically equivalent in the offline mode when samples were pre-processed using VAT. Remarkably, performance comparisons to non-ART-based clustering algorithms show that DDVFA (which learns incrementally) was also statistically equivalent to the non-incremental (offline) methods of DBSCAN, single linkage hierarchical agglomerative clustering (HAC), and offline version of k-means, while retaining the appealing properties of ART. Links to the source code and data are provided. Considering the algorithm’s simplicity, online learning capability, and performance, it is an ideal choice for many agglomerative clustering applications.

keywords:

Fuzzy, Adaptive Resonance Theory, Clustering, Distributed Representation, Topology, Visual Assessment of Cluster Tendency.

††journal: arXiv.org

1 Introduction

There is a rich literature of clustering methods [1, 2, 3], and among the neural network-based ones, adaptive resonance theory (ART) [4] is of great interest due to its many useful properties [5], particularly the fact that it addresses the stability-plasticity dilemma. After sufficient exposure to the environment, a competitive learning neural network eventually learns prototypical representations or archetypes that reflect groups of samples [6]; i.e., it learns a succinct or compressed representation of the data.

Numerous ART-based architectures have been conceived, such as fusion ART [7], whose variants have been effectively used for semi-supervised [8], supervised [9], and reinforcement learning applications [10, 11, 12]; BARTMAP [13, 14] for biclustering applications, such as unsupervised gene expression analysis, as well as architectures with distinct internal category representations such as hyperboxes [15]; gaussians [16, 17]; hyperspheres [18]; hyperellipsoids [19]; and others.

Particularly, ART has been used as the basis for several hierarchical clustering methods, which can be classified into bottom-up (agglomerative or merging methods) and top-down (divisive or splitting methods) [2]. Hierarchical ART architectures generally follow two main designs [20]: (a) a series/cascade of ART modules where the output of one ART (i.e., a prototype) is the input of the next [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] or (b) parallel ART modules sharing the same inputs and using different vigilance values [32, 33, 6, 34, 35, 36, 37, 38, 39]. Generally, the hierarchical relationships between ART modules are defined implicitly by the input signal flow, explicitly by enforcing constraints or connections, and/or by the setting of multiple vigilance parameters to define hierarchies. Alternatively, hierarchies within the same ART can be created by designing custom ART activation functions [40, 41] or by analyzing its distributed activation patterns [42]. ART-based hierarchical approaches have been successfully applied, for instance, in text mining [43, 20] and robotics [30, 39].

Another branch of clustering includes multi-prototype-based methods. These allow multiple prototypes to represent a single cluster and more accurately capture the data topology, thereby typically handling clusters with arbitrary shapes. Multi-prototype representations have been successfully used for clustering [44, 45, 46, 47, 48], visualization [49, 46, 50], and validation purposes [51, 52]. In the context of ART, examples include the combination of an ART-like system using quadratic neurons [53] and hierarchical clustering [54, 55] and the related approach [56] using fuzzy ART [15]. Other methods have augmented ART-based systems by employing dual vigilance parameters [57], connecting the first and second resonating categories [58, 59, 60, 61, 35, 36, 37, 38], or replacing fuzzy ART’s nodes with growing cell structures [62] in a hybrid architecture [63].

Although they are based on multi-prototype representation, many of the previously mentioned approaches do not adopt distributed activation, match or learning, which improves a network’s noise robustness and compactness [64, 65]. The distributed ART model [64] is endowed with all of these distributed features, however it does not possess a mechanism to build, in an unsupervised manner, a permanent and binary many-to-one mapping (i.e., a multi-prototype cluster representation). Thus, it is still limited by its nested hyperbox cluster abstractions. Distributed learning is also featured in the ART variants introduced in [66, 67]. In the ART literature, the power of distributed activation has been harnessed to perform, for instance, (a) unsupervised feature extraction [68]; (b) hierarchical clustering [21, 29] – although featuring distributed representation, the latter approaches are cascade architectures not designed to model arbitrarily-shaped clusters since they are limited by their category representations at each hierarchical level; and (c) supervised learning systems such as the distributed ARTMAP [65], which is a generalization of a variety of ART models [69] such as [15, 70, 69, 71, 72] and uses distributed ART as its building block, some topoART variants [73, 74], default ARTMAPs [69, 71], and adaptive resonance associative map [9] variants [75, 31].

The distributed dual vigilance fuzzy ART (DDVFA) introduced here belongs to the class of modular neural networks [76, 77, 78]. Specifically, it is designed for the unsupervised learning task of clustering. This class of network architectures employs a divide-and-conquer approach and shares the following main features [76, 77, 78]: task decomposition (breaking down a complex problem) and multi-module decision making (combining local decisions in a single global consensus). Commonly, unsupervised learning methods are used as a pre-processing stage to partition the data to be handled by simple, fast, and efficient supervised modules. ART-based systems have been used for such purposes in supervised modular networks [76, 77, 78]. A current challenge for incremental learners, such as ART-based systems, is the order of sample presentation. Thus, suitable pre- and post-processing strategies are usually employed when applicable (see references in [79]). Specifically, post-processing merging strategies are commonly used in conjunction with incremental learners (e.g., [80, 81, 59, 60, 61, 82, 83, 31]); here, a novel ART-based network provides such functionality. Additionally, visualization and assessment are valuable assets when performing cluster analysis [2, 84, 50]; here, the visual assessment of cluster tendency (VAT) [85, 84] technique is used for its sample ordering properties to emulate scenarios in which such data pre-processing is practical, as per [79].

This paper presents the following main contributions:

A novel modular fuzzy ART-based architecture (DDVFA). Unsupervised dynamic modularization (creation of new local modules as needed) and multi-prototype representation are accomplished by employing dual vigilance parameters associated with global and local fuzzy ART modules. 2. 2.

Novel higher order distributed activation and normalized match functions based on hierarchical agglomerative clustering (HAC) methods embedded in the incremental learning process. Suitably setting the HAC-based activation/match functions allows DDVFA to retrieve arbitrarily shaped clusters, and higher order match functions have the potential to generate more compact DDVFA networks (as per [64, 65]) and extend the regions of successful dual vigilance parameter combinations. 3. 3.

A novel Merge ART module compatible with DDVFA for post-processing in online learning applications. This procedure compensates for the errors caused by the random order of input presentations thus enabling improved performance. 4. 4.

An analysis of the behavior of the DDVFA with and without pre-processing (VAT) and post-processing (Merge ART) strategies, as well as with respect to its kernel width parameter.

The results show that together, these features enable DDVFA to yield an improved performance compared to other current state-of-the-art fuzzy ART-based technologies.

The remainder of this paper is divided as follows: Section 2 provides a brief review of ART, fuzzy ART, fuzzy topoART, and dual vigilance fuzzy ART; Section 3 introduces distributed dual vigilance fuzzy ART; Section 4 describes the experimental set-up; Section 5 reports and discusses the results; and Section 6 is the conclusion.

2 Adaptive Resonance Theory

Adaptive resonance theory (ART) [86] is the theory that learning is often mediated by resonant feedback in neural circuits. It inspired the development of many neural network architectures, each with its own internal categorical representation, while sharing the same design principles (Fig. 1). The ART matching rule [4] is a key property of these ART systems [69, 71]; it regulates the interaction between top-down expectations (represented by the internal categories or templates) and the bottom-up inputs. This process is guided by an orienting subsystem, which performs a hypothesis test, called the vigilance check, that either shuts down or enables an ART category to learn. ART templates have specific properties and governing equations based on their internal representation. They allow for a discretization of the data space, thus summarizing it as clusters. The vigilance parameter (see Eq. (3)) controls category size and thus the granularity of this discretization.

2.1 Fuzzy ART

Fuzzy ART [15] is an ART architecture designed to work with real-valued data. Concisely, when a sample $\bm{x}\in\mathbb{R}^{d}$ is presented at the feature representation field $F_{1}$ , it activates the category $j$ at the category representation field $F_{2}$ whose weight vector $\bm{w}_{j}$ maximizes the following activation function:

[TABLE]

where $|\cdot|$ is the $L_{1}$ norm and $\alpha>0$ is the choice parameter, which is usually set to a small value. A comprehensive study on its behavior can be found in [87].

Next, a match function evaluates the best matching category as:

[TABLE]

and a vigilance check $\nu$ is performed using the computed match value:

[TABLE]

where $0\leq\rho\leq 1$ is the vigilance parameter. If $\nu$ is satisfied, then the winning category’s weight vector is updated as:

[TABLE]

where $0<\beta\leq 1$ is the learning rate parameter. Otherwise, this category is deactivated, and the search continues by activating the next highest ranked category. If none of them satisfies this constraint, then a new category is created to encode sample $\bm{x}$ . Thus, the problem of selecting the number of clusters is traded for the one of selecting the vigilance value $\rho$ .

Fuzzy ART features many appealing properties such as scalability, speed, stability, plasticity, online (one pass) and offline incremental learning modes, as well as simple implementation, transparency, and novelty detection (rare/unusual events) [69, 71, 5, 2, 88].

2.2 Fuzzy topoART

Fuzzy topoART [35] incorporates topology-based learning [89] into ART. Briefly, it consists of multiple independent fuzzy ART modules where the preceding modules filter the shared inputs to subsequent ones. Standard topoART consists of two identical modules: A and B. During training, which is processed in parallel for all modules, an “instance counting” feature accounts for the number of samples $n$ learned by a given category. Every $\tau$ learning cycles/iterations (number of sample presentations), a noise thresholding procedure is performed to remove categories with less than $\phi$ samples. Once the threshold is surpassed, “candidate” categories become “permanent” categories. A sample is propagated to module B if it has resonated with a “permanent” category of module A.

The granularity of the solutions is defined by the modules’ different vigilance parameter values. Module B’s vigilance parameter is [35, 36, 37]:

[TABLE]

where $\rho_{a}$ is module A’s vigilance parameter. Since $\rho_{b}\geq\rho_{a}$ , modules A and B yield increasingly finer partitions of a given data set. Categories are laterally connected by edges between the first and second resonating categories (i.e., the two highest ranked categories that simultaneously satisfy the vigilance test (Eq. (3))) to mirror the input distribution. This multi-prototype method enables topoART modules to learn topologies and capture clusters with arbitrary geometries. Besides competitive learning, it also uses cooperative learning by allowing the second winner ( $sbm$ ) to learn with a smaller learning rate than the first ( $bm$ ): $\beta_{sbm}<\beta_{bm}=1$ . Finally, to compensate for fuzzy ART’s bias toward small categories, topoART uses a particular activation function for prediction, which is independent of category size [35, 36, 37]:

[TABLE]

TopoART has spawned several variants for unsupervised [36, 37, 38], supervised [74, 73], and semi-supervised [90] learning paradigms.

2.3 Dual vigilance fuzzy ART

Dual vigilance fuzzy ART (DVFA) [57] consists of a single ART module equipped with two layered vigilance parameters. The larger vigilance value is referred to as the “upper bound” ( $\rho_{UB}$ ) and is responsible for the data compression/quantization, whereas the lower vigilance value is referred to as the “lower bound” ( $\rho_{LB}$ ) and is responsible for the cluster similarity. Briefly, when a category is activated after a winner-takes-all competition, then a vigilance check with a large value is performed (using $\rho_{UB}$ in Eq. (3)); if it is satisfied, then it behaves identically to fuzzy ART. However, if this test fails, then a second test is performed with a slightly smaller vigilance value (using $\rho_{LB}$ in Eq. (3)). If the same category satisfies this looser constraint, then a new category is created and assigned to the same cluster as the tested category in an output mapping matrix like fuzzy ARTMAP’s [70]. Therefore, a many-to-one mapping of categories to clusters is created (this is a multi-prototype approach). In this manner, the data distribution can be more faithfully mirrored, and clusters of arbitrary geometries may be retrieved.

3 Distributed dual vigilance fuzzy ART

The distributed dual vigilance fuzzy ART (DDVFA) neural network architecture described in Section 3.1 can be viewed as an “ART of ARTs”, in which each node in the category representation field $F_{2}$ of a global ART is itself a local ART, where the latter represents a given data cluster. Equivalently, it can be seen as an unsupervised modular neural network consisting of local ARTs whose multi-module decision making system is a global ART. Since ART-based systems are sensitive to the order of input presentation, Section 3.2 presents an approach to compensate for this dependency: the output of a DDVFA module (layer 1) is cascaded into a compatible Merge ART module (layer 2).

3.1 DDVFA architecture

Table 1 lists the notation used in this section, and Fig. 2 depicts a generic DDVFA. It is a modular structure in which a global ART controls local parallel ARTs via a vigilance feedback between these modules – cf. ART tree [32, 33], in which $F_{2}$ nodes are also ART modules, but these are not controlled by a global ART module. The global ART acts as a mapping mechanism analogous to the inter-ART module in fuzzy ARTMAP architectures [70, 91], thus maintaining hierarchical consistency. This relates to self-consistent modular ART [6]; however, DDVFA uses a bottom-up agglomerative approach, whereas the former uses a top-down divisive approach limited to hyperrectangular cluster representations. Concretely, DDVFA is a multi-prototype hierarchical agglomerative clustering (HAC) method that builds a self-consistent two-level hierarchy of categories.

Similar to DVFA, the vigilance parameters of the global and local ARTs are denoted as $\rho_{LB}$ and $\rho_{UB}$ , respectively, where the constraint $\rho_{LB}\leq\rho_{UB}$ is enforced. Setting $\rho_{UB}=\rho_{LB}$ reduces the DDVFA to a generic fuzzy ART framework, which ensures that each global ART’s $F_{2}$ node (i.e., each local ART) encodes one category. Alternately, setting $\rho_{UB}$ strictly greater than $\rho_{LB}$ builds a multiple category representation for each cluster, thus enabling an approximation of that cluster’s geometry over the data space according to the underlying assumption of the activation and match functions, which are to be set a priori. The vigilance parameters $\rho_{LB}$ and $\rho_{UB}$ reflect the minimum similarity of a cluster and the granularity level of the data quantization (i.e., the categories’ sizes), respectively. In other words, the rationale is to restrict the maximum internal category size of each local ART while maintaining a smaller similarity constraint for the cluster represented by each global ART $F_{2}$ node. Thus, local ART modules (or clusters) can be added as needed.

The inner workings of DDVFA are the same as a generic ART architecture, as reviewed in Section 2. However, the activation $T^{ART_{i}}(\cdot)$ and match $M^{ART_{i}}(\cdot)$ functions of the global ART’s $F_{2}$ node $i$ are a distributed version of the local $ART_{i}$ categories’ activation $T^{ART_{i}}_{j}$ and match $M^{ART_{i}}_{j}$ functions based on HAC, where $j=\{1,...,k\}$ represents the categories. Specifically, the activation and match functions of global ART’s $F_{2}$ node $i$ in layer (1) are given by a function of local $ART^{(1)}_{i}$ ’s $k$ nodes:

[TABLE]

where

[TABLE]

and

[TABLE]

where

[TABLE]

In this study, for simplicity, $f\left(\cdot\right)=g\left(\cdot\right)$ in (7) and (9), i.e., the same functional relationship is used for the activation and match functions. These are listed in Table 2 and are based on HAC methods [2]. A power parameter $\gamma\geq 1$ is employed here in both the activation and match functions. Like the power parameter used in [64, 65], $\gamma$ assumes the role of a kernel width, facilitates the dual vigilance parameters selection, and reduces category proliferation (Section 5.4). Setting $\gamma=1$ corresponds to a standard fuzzy ART module, in which a moderately far sample would still have a reasonably large value for the match function.

This extension of successful dual vigilance parameters occurs because the match and activation functions (when $\gamma=1$ ) decay linearly and slowly for samples outside a category’s hyperrectangular boundaries and thus, by increasing $\gamma$ , steeper decays are created (Fig. 3). A similar behavior is exhibited by fuzzy min-max neural networks [92, 93, 94] and the variant [94] addresses it by devising a custom fuzzy membership function, whose sensitivity parameter performs the same role of controlling the membership value decays. Furthermore, the higher order membership class of functions has been shown to enhance fuzzy ART performance [95].

The property exploited here is the fact that the activation and match functions become more “selective” (as expected from a power rule as a contrast-enhancement procedure [64, 65]); e.g., in Fig. 3 their trapezoidal form approaches a rectangular membership function. Therefore, regarding the match function, increasing $\gamma$ makes far samples less similar and a category’s vigilance region [96] smaller (Fig. 3). Naturally, when applying a power rule to a scalar in the range $[0,1]$ , such as the case of the match and activation functions, its value decreases with $\gamma$ . Therefore, to account for the scaling effect, instead of using (10), the match function is normalized in practice as:

[TABLE]

where $0\leq\gamma^{*}\leq\gamma$ is the reference kernel width with respect to which the match function is normalized (see A). In this paper’s experiments, such normalization was performed with respect to the match function values of a standard fuzzy ART (i.e., $\gamma^{*}=1$ ). Note that the higher order HAC-based activation functions in Eq. (8) do not change the search order for global ART when varying $\gamma$ for single, complete, and centroid methods; but it may for weighted and average. Additionally, it also does not affect the search order within the local fuzzy ART module using the higher order activation and match functions.

Remark 1. A power law was introduced in distributed ART/ARTMAP [64] for the increased gradient content-addressable memory rule as a contrast enhancement procedure, and it has been used in other ART variants such as distributed ARTMAP [65] and default ARTMAPs [69, 71]. As opposed to the latter ART systems, where the activation functions are normalized to $1$ with respect to a subset of highly active nodes, DDVFA’s activation functions are not normalized, but rather its match functions. Specifically, the latter are normalized using a reference parameter $\gamma^{*}$ and with respect to an individual category; additionally, DDVFA’s match-reset-search mechanism itself is distinct and uses winner-takes-all learning, as opposed to distributed ART’s distributed learning.

Remark 2. There are subtle, yet fundamental, differences between DVFA and DDVFA besides the architecture itself and the distributed HAC-based higher order nature of the activation and normalized match functions. The first one relates to the search mechanism. In DVFA, it is theoretically possible for categories mapped to the same cluster to be brought up during the search process. Conversely, in DDVFA, if a global ART node does not satisfy the vigilance test, then its local ART and the cluster it represents (which includes all its categories) is shut down and will not appear again during global ART’s search. Another difference is that, according to Eq. (9) and Table 2, the match functions are distributed, and, in the case of single and complete variants, the category selected by winner-takes-all competition and the category subjected to the vigilance test are not required to be the same.

Naturally, DDVFA integrates a winner-take-all mechanism to select among global ART’s $F_{2}$ nodes (i.e., local FAs) with a variety of distributed HAC-based activation/match functions, which are computed using local fuzzy ART’s weight vectors. According to their definitions (Table 2), they range from winner-take-all (single) and loser-take-all (complete) to completely distributed (average, centroid, and weighted). DDVFA can be viewed as an ART-based online incremental approximate (prototype-based) HAC method. If $\rho_{UB}^{(1)}=1$ , then the approach reduces to an ART-based HAC, since each local fuzzy ART’s category encodes a single sample, and the dendrogram cut-level is defined by the global ART module’s vigilance parameter $\rho_{LB}^{(1)}$ . Algorithm 1 summarizes the DDVFA’s pseudocode.

3.2 Merge ART module

The order of input presentation is a challenge for incremental learners as it plays a significant role in such systems’ performance (see references in [79]). For this reason, a Merge ART module (Fig. 4) is introduced here to be placed at layer 2, i.e., on top of the DDVFA in a cascade design. It acts as another ART module with dual vigilance parameters in which the inputs are ART nodes from DDVFA. It has its own set of parameters that are independent of DDVFA. However, for simplicity, DDVFA’s activation and match functions functional forms were kept to maintain the same underlying cluster assumptions, and $(\rho^{(2)}_{LB},\rho^{(2)}_{UB})$ were set to $(\rho^{(1)}_{LB},\rho^{(1)}_{UB})$ .

The merging process consists of unions or concatenation of local fuzzy ARTs followed by compressions within each set of local fuzzy ARTs. Let $\bm{T}_{k,l}=[t_{ij}]_{R\times C}$ and $\bm{M}_{k,l}=[m_{ij}]_{R\times C}$ be the activation and match matrices of Merge ART’s $F_{2}$ node $ART^{(2)}_{k}$ when the input $ART^{(1)}_{l}$ (from DDVFA) is presented, where $R$ and $C$ are the number of categories of Merge ART’s $ART^{(2)}_{k}$ and DDVFA’s $ART^{(1)}_{l}$ , respectively. The entries of matrices $\bm{T}_{k,l}$ and $\bm{M}_{k,l}$ are computed as:

[TABLE]

The activation and match functions of the Merge ART module are listed in Table 3. When resonance is triggered, i.e., when the condition $M^{ART^{(2)}_{K}}\geq\rho_{LB}^{(2)}$ is satisfied, then $ART^{(2)}_{K}(new)\leftarrow ART^{(2)}_{K}(old)\cup ART^{(1)}_{l}$ . Finally, to compress the representation, i.e., to reduce the number of categories, in the last step of the Merge ART procedure, the category weight vectors $\bm{w}^{ART^{(2)}_{k}}$ and instance countings $n^{ART^{(2)}_{k}}$ of each local ART module are fed to a fuzzy ART with higher order activation and match functions, using the parameters $\rho=\rho_{UB}^{(2)}$ , $\gamma^{*}=1$ , and $\gamma$ ; in this case, when a category learns using Eq. (4) then its instance counting is updated as $n^{new}=n^{old}+n^{w}$ , where $n^{w}$ is the instance counting of the category presented as an input.

The Merge ART module can be triggered at any stage during incremental learning. For convenience, in this study it is activated by the end of one epoch (a full pass through the data, similar to [83]), i.e., after $N$ samples are presented to the learning system, where $N$ is made equal to the data cardinality. Therefore, this framework may perform online incremental approximate HAC without computing a distance matrix with the entire data or requiring full recomputations when new samples are presented. Again, as the vigilance parameter $\rho_{UB}$ approaches 1, there is little to no data compression. Merge ART relates to traditional HAC approaches using ART’s activation function as the similarity measure and the match function as the dendrogram threshold level, i.e., the activation and match functions of the Merge ART module perform an ART-based HAC using the ART weights created by DDVFA. Algorithm 2 summarizes the Merge ART module’s pseudocode.

Remark 3. Merging strategies are commonly employed in ART-based systems. The Merge ART module presented here is closely related to the ART category merging methods discussed in [80, 59, 60, 61, 82, 83, 31] and especially the frameworks in [83, 31]. In the latter, fuzzy ART weights are merged via a fuzzy ART module with its own set of parameters. Although both the DDVFA + Merge ART and the strategy in [83, 31] use a fuzzy ART framework for merging, they have the following fundamental differences: (a) Merge ART’s inputs are local fuzzy ART modules from DDVFA (i.e., subsets of categories) to be merged using a fuzzy ART framework augmented with HAC-based distributed higher order activation and match functions; (b) the output of the merging procedure includes not only categories but also ART modules; (c) Merge ART’s compression step does not use an activation threshold (as in [83]), but instead it uses higher order activation/match functions (in contrast to [83, 31]); (d) the weight update is not based on an overlap/gap between weights (as in [83]), but instead it follows standard fuzzy ART rules (Eq. (4)) which correspond to the weight merging in [31] (and [82] in fast learning mode); and (e) the vigilance parameter used to cluster samples is also used to merge weights during the compression step (in contrast to [83]).

The Merge ART module was designed such that its output can be used to replace DDVFA when the merging procedure is done. The fact that $\rho^{(2)}_{LB}$ used to concatenate DDVFA’s local FAs is smaller than $\rho^{(1)}_{UB}$ used to cluster the samples, ( $\rho^{(2)}_{LB}=\rho^{(1)}_{LB}\leq\rho^{(1)}_{UB}=\rho^{(2)}_{UB}$ ), conforms with the findings reported in [83] that this setting yields a good performance for merging fuzzy ART weights. This is expected, since the overall architecture (DDVFA + Merge ART) is multi-layered and related to ART-based serial structures (e.g., [22, 25]), which in turn typically follow similar parameterization.

4 Experimental Setup

4.1 Data sets

A mix of $30$ real world and artificial benchmark data sets comprising diverse characteristics were used in the experiments. They are available at the UCI Machine Learning Repository [97], Fundamental Clustering Problem Suite [98], Clustering data sets [99], and Data package [100]. Fig. 5 illustrates these data sets, and Table 4 summarizes their characteristics. Linear normalization was applied to all data sets to scale their features to the range $[0,1]$ , as well as complement coding, which is a useful data representation technique to mitigate category proliferation in fuzzy ART.

4.2 Clustering algorithms and parameter tuning

To set the parameters of the clustering algorithms employed in the experiments, grid searches were performed through their parameter spaces. For all algorithms, the best solution was selected according to the parameter combination that yielded the peak average performance.

4.2.1 ART-based clustering methods

Fuzzy ART, fuzzy topoART, and DVFA were compared to DDVFA. In the experiments performed, fuzzy ART’s, DVFA’s and DDVFA’s vigilance parameters were scanned in the range $[0,1]$ with identical step sizes equal to $0.01$ (DVFA’s and DDVFA’s vigilances were also subjected to the constraint $\rho^{UB}\geq\rho^{LB}$ ). For all fuzzy ART modules, the maximum number of epochs was set to $1$ (online mode), the choice parameter ( $\alpha$ ) was set to $0.001$ , and the learning rate ( $\beta$ ) was set to 1 (fast learning). DDVFA’s parameters $\gamma^{*}$ and $\gamma$ were set to $1$ and $3$ , respectively; and, for simplicity, $\rho^{(1)}_{UB}=\rho^{(2)}_{UB}$ and $\rho^{(1)}_{LB}=\rho^{(2)}_{LB}$ . Moreover, in all the fuzzy ART implementations, no uncommitted category participated in the winner-take-all competitive process. If none of the current committed categories satisfy the vigilance criteria, then a new one is created and set to the current sample (fast commit). Regarding topoART, the parameters $\rho_{a}$ , $\beta_{sbm}$ , $\phi$ and $\tau$ were scanned in the ranges $[0,1]$ with a step size of $0.008$ , $[0,0.75]$ with a step size of $0.25$ , $[1,4]$ with a step size of $1$ , and $[10\%,30\%]$ of the data cardinality with a step size of $10\%$ , respectively. These ranges and step sizes generated approximately the same number of parameter combinations for topoART, DVFA, and DDVFA. Module B’s clusters were taken as topoART’s output. Finally, for all these methods, $30$ runs were performed for each data set in both random and VAT ordered presentation scenarios.

4.2.2 Non-ART-based clustering methods

DBSCAN [111], affinity propagation (AP) [112], k-means [113], and single linkage (SL-HAC) [2] were compared to DDVFA. In the experiments performed, DBSCAN’s $MinPts$ parameter was varied in the range $[1,4]$ with a step size of $1$ , while $eps$ was scanned in the range $[0,\sqrt{d}]$ with a step size of $0.005$ , where $d$ is the dimensionality of the data (thus encompassing the full range of possible distance values in the $d$ -dimensional unit cube). The number of clusters $k$ in k-means was varied in the range $\left[1,\left\lceil\sqrt{N}\right\rceil\right]$ , where $N$ is the cardinality of the data set (this upper bound is usually taken as a rule of thumb [114, 115]). Additionally, k-means was repeated $10$ times, and the best solution, according to the cost function being minimized, was selected for each value of $k$ . The AP’s damping factor $\lambda$ was varied in the range $[0.5,1]$ with a step size of $0.005$ , and the preference parameter was set as the median of the data samples’ similarities. SL-HAC used Euclidean distance, and its dendrogram was cut at all merging levels. Finally, for all these methods, a single run was performed for each randomized data set, since they are global approaches that are either not (or almost not) order dependent.

4.3 Clustering performance assessment

The adjusted rand index ( $AR$ ) [116] is an external cluster validity index commonly used in the unsupervised learning literature to measure the level of agreement between a data sets’ reference partition (i.e., ground truth structure) and a discovered partition [2]. It was used in this work to evaluate the quality of the solutions returned by all clustering algorithms. The ( $AR$ ) is defined as:

[TABLE]

where $tp$ , $tn$ , $fp$ and $fn$ stand for true positive, true negative, false positive, and false negative, respectively.

4.4 Statistical analysis methodology

The clustering algorithms were compared following the procedures discussed in [117]:

The quantities of interest (i.e., performance in terms of AR and network compactness) were tested for equality using Iman-Davenport’s correction [118] of Friedman’s non-parametric rank sum test [119, 120]. 2. 2.

If there was sufficient evidence to reject the null hypothesis, then a critical difference (CD) diagram [117] was generated using Nemenyi’s post-hoc test [121].

4.5 Software and code

The experiments were conducted using MATLAB, scikit-learn [122], Orange [123], and Cluster Validity Analysis Platform [124]. The MATLAB code for fuzzy ART, DVFA, and DDVFA is available at the Applied Computational Intelligence Laboratory group GitHub repositories111https://github.com/ACIL-Group/DVFA.,222https://github.com/ACIL-Group/DDVFA.. The topoART experiments were carried out using LibTopoART333LibTopoART v0.74, available at https://www.libtopoart.eu. [35], whereas the other clustering algorithms’ implementations were from scikit-learn444http://scikit-learn.org/.

5 Results and discussion

5.1 DDVFA results with pre- and post-processing

This study investigates DDVFA’s order of presentation dependency by analyzing two frameworks: an offline approach that consists of pre-ordering the shuffled samples using VAT [85], as per [79], and an online approach in which the samples are solely randomized prior to presentation. The latter is a more realistic scenario when an online incremental learner is required, i.e., a learning system is confronted with a data stream. That is why all the experiments were conducted with one epoch (single pass), so each data sample is only presented once.

Employing the methodology described in subsection 4.2, the experiments were performed with the following three systems: (1) DDVFA, (2) VAT + DDVFA, and (3) DDVFA + Merge ART. The results are summarized in Fig. 6, which depicts radar charts of the peak average performance of all the mentioned systems grouped by the type of HAC-based activation/match functions (i.e., per Tables 2 and 3’s method): (6\alphalph) average, (6\alphalph) centroid, (6\alphalph) complete, (6\alphalph) median, (6\alphalph) single, and (6\alphalph) weighted. It shows that, in general, VAT pre-ordering yields a better performance than pure DDVFA or post-processing with Merge ART. The latter approaches yielded a similar performance across all types of activation/match functions, except for the single-linkage based DDVFA, in which using Merge ART makes a significant difference compared to DDVFA by itself. For instance, Fig. 7 illustrates the outputs of DDVFA before and after cascading it with Merge ART for the Spiral, Face, Atom and Chainlink data sets.

5.1.1 Statistical analysis of performance

Using the Iman-Davenport test, a statistical analysis was conducted to quantitatively assess if the performances of the different types of HAC-based activation/match functions (average vs. centroid vs. complete vs. median vs. single vs. weighted) were equivalent when fixing the type of DDVFA system. All these performance equivalency hypotheses were rejected at a $0.05$ significance level (Table 5). Therefore, Nemenyi’s test was performed, and Fig. 8 depicts the resulting CD diagrams. They indicate that the best performing groups seem to be: (Fig. 8\alphalph) {average, single, weighted, median}, (Fig. 8\alphalph) {weighted, median}, and (Fig. 8\alphalph) {single, weighted}; and the worst performing groups seem to be: (Fig. 8\alphalph) {centroid}, (Fig. 8\alphalph) {centroid, complete}, and (Fig. 8\alphalph) {centroid, complete}, respectively. The fact that the best average rank for DDVFA is achieved by the weighted variant is expected since it considers additional information in the form of local prior probabilities.

A similar statistical analysis was conducted to determine if the performances of the systems (DDVFA vs. VAT + DDVFA vs. DDVFA + Merge ART) were equivalent when fixing the type of activation/match functions. All these null hypotheses were rejected at a $0.05$ significance level (Table 6). Therefore, Nemenyi’s test was performed, and, for clarity, Fig. 9 solely depicts the resulting CD diagrams of selected HAC-based activation/match functions. Typically, pre-processing with VAT or post-processing with the Merge ART module are statistically equivalent, and, as expected, they are statistically better than just feeding the shuffled data directly to DDVFA.

5.1.2 Summary

The statistical analysis suggests that pre-processing with VAT or post-processing with Merge ART yields better results than just DDVFA. Furthermore, in general, single, median, average and weighted HAC-based activation/match functions appear to be statistically equivalent. Thus, the recommended systems are DDFVA + Merge ART for online learning mode and random presentation, and VAT + DDVFA for offline learning mode and applications where pre-ordering is feasible; for both of these systems the single-linkage variant is recommended since it appeared in the top 2 average rank for both learning modes.

[FIGURE:]

5.2 Performance comparison 1: ART-based clustering algorithms

Table 7 lists the $AR$ peak average performance of fuzzy ART, DVFA, topoART B, and DDVFA for both random and VAT ordered presentation scenarios. Given the results of Section 5.1’s statistical analyses, the VAT + DDVFA and DDVFA + Merge ART systems were selected, and the performance was recorded with respect to single linkage-based activation and match functions variant.

5.2.1 Statistical analysis of performance

The hypothesis that these algorithms perform equally was tested using the Iman-Davenport statistic and rejected at a $0.05$ significance level for both random (p-value=1.1102E-16) and VAT orderings (p-value=3.2012E-07). Therefore, the CD diagrams were further computed, as shown in Fig. 10, using Nemenyi’s test. As shown, VAT pre-processing (offline incremental mode) equalizes performance, such that all multi-prototype ART-based algorithms become statistically similar, while also outperforming fuzzy ART. Alternately, when data is presented randomly in an online incremental mode DDVFA + Merge ART yields a statistically better performance than all the other ART-based algorithms at a $0.05$ significance level. DVFA and topoART B were observed to be statistically equivalent (as expected per [57]) while also surpassing standard fuzzy ART. In the vast majority of the remaining comparisons among TopoART, DVFA, DDVFA systems, and fuzzy ART, no significant statistical difference was observed among the first three, while all of them outperformed fuzzy ART.

5.2.2 Statistical analysis of compactness

The compactness of the multi-prototype ART-based networks were also compared, i.e., the number of categories that were created to represent the data sets’ clusters. The hypothesis of equivalence (using Iman-Davenport’s test) was rejected at a $0.05$ significance level, with p-values equal to (a) 5.2039E-03 for VAT pre-ordering and (b) 1.7622E-02 for random presentation. Given this outcome, the corresponding CD diagrams were generated as shown in Fig. 11 using Nemenyi’s test. In online learning mode, in which samples are presented randomly, topoART has the best average ranking for compactness. Yet, in offline learning mode, in which order-dependence can be managed via pre-processing strategies, DDVFA + Merge ART has a better average compactness ranking than topoART. However, their observed compactness were similar and with no statistically significant difference. As expected, topoART creates more compact networks than DVFA in all scenarios [57]. Note that improved compactness may be obtained by carefully tuning parameter $\gamma$ .

5.2.3 Summary

The statistical analysis suggests that if pre-processing with VAT, then topoART, DVFA, and DDVFA seem to perform equally; whereas for random presentation DDVFA + Merge ART’s performance was observed to be statistically better than the remaining ART-based systems. Moreover, no statistical differences were found between the compactness of topoART and DDVFA systems using single linkage functions for neither randomly or VAT ordered presentations, and both achieved a better average rank than DVFA.

5.3 Performance comparison 2: non-ART-based clustering algorithms

Table 7 also reports the performance of k-means, DBSCAN, affinity propagation (AP), and single linkage (SL-HAC). Again, the Iman-Davenport test was used to compare these algorithms to (a) VAT + DDVFA, and (b) DDVFA + Merge ART. These null hypotheses were rejected at a $0.05$ significance level with p-values equal to (a) 1.4944E-08, and (b) 4.5854E-07. Next, the CD diagrams were generated using Nemenyi’s test, as shown in Fig. 12. It was observed that for these data sets all clustering algorithms seem to be statistically equivalent at a $0.05$ significance level, except for AP. Nevertheless, both DDVFA systems (VAT + DDVFA and DDVFA + Merge ART) have a smaller average rank value (particularly when using the VAT pre-processor). This on par performance is remarkable, especially regarding the comparison with the DDVFA + Merge ART system, since in this case clustering is performed both incrementally and online, as opposed to the other global clustering methods. Re-performing the computations using the entire data set is not required if a new sample is presented (c.f., SL-HAC). Therefore, it is possible to extend the current knowledge base. Moreover, the weights do not cycle, and previously acquired knowledge is not forgotten (c.f., k-means). These important advantages of DDVFA are inherited from ART.

5.4 Sensitivity to kernel width parameter

To examine the behavior of the DDVFA systems with respect to parameter $\gamma$ , $\gamma=1$ and $\gamma=3$ were arbitrarily set, and Wilcoxon’s signed-ranks tests [125] were conducted to compare the performance and compactness of the best dual vigilance parameter combination (peak average performances over 30 runs). The results are reported in Table 8.

Regarding the HAC-based activation/match functions, a significant statistical difference for both performance and compactness was observed for (a) the single variant, (b) most systems using centroid and median, and (c) the complete variant but to a lesser extent. Average and weighted variants do not appear to be very affected by changing parameter $\gamma$ between these two values. With respect to the three DDVFA systems, performance and compactness are affected by parameter $\gamma$ , except for the compactness of the VAT + DDVFA system which remains mostly unaffected.

Due to these statistical analysis results, the DDVFA systems’ behavior was further investigated using single-linkage HAC activation/match functions with respect to parameter $\gamma$ . The study is performed by varying $\gamma$ in the interval $[0,5]$ with a step size of $0.5$ and observing the following aspects: peak average performance ( $AR$ ), number of clusters, and number of categories created. The last two quantities were examined since DDVFA belongs to the class of multi-prototype-based clustering methods, i.e., each cluster may be represented by multiple categories. Such behaviors are illustrated in Figs. 13 through 15. For clarity, and according to the recommendations outlined in Section 5.1, only the behavior with respect to the data sets Seeds, Wine, Target, Tetra, Lsun, and Moon is reported.

For each value of $\gamma$ , the vigilance parameter combination corresponding to the best average performance over $10$ different input permutation orders is selected. Following Occam’s razor and the principle of parsimony [126], among all models that yield the best performance, the one with the simplest clustering structure is selected, i.e., the one that requires the smaller number of categories to encode its clustering partition. Thus, the depicted box-plots relate to the simplest model that achieved the peak average performance for each value of $\gamma$ .

Remark 4. Note that the vigilance parameter combinations that yield each box-plot in Figs. 13 through 15 are not held constant across the different values of $\gamma$ ; therefore, they may not be necessarily the same. For instance, Fig. 13 shows that, for the VAT + DDVFA system, given a value of $\gamma$ , there is a vigilance parameter combination that can find the correct partitions ( $AR=1$ ) with similar compactness levels (number of categories) across $\gamma$ values for the Target, Tetra, Lsun, and Moon data sets. Analogously, given a value $\gamma$ , there is a vigilance parameter combination for the DDVFA + Merge ART system that yields maximum $AR$ for the Target, Lsun, and Moon data sets; however, the number of categories fluctuates when the samples are randomly presented. If the dual vigilance parameter combination is held constant, e.g., by setting it to the best combination associated with $\gamma=1$ , then, for other $\gamma$ values, the behaviors with respect to performance, number of clusters and categories may change for both systems, as shown in Fig. 16 for the Target data set. Note the increase in the number of categories due to the increase of $\gamma$ : the smallest dual vigilance parameter values required to achieve the best performance for $\gamma=1$ are somewhat large, and the same values coupled with a more selective kernel (larger $\gamma$ ) result in more categories being created.

Naturally, the behavior of the DDVFA systems with respect to $\gamma$ is data- and system-dependent. Although some $AR$ performance fluctuation exists across the values of $\gamma$ for some data sets, it generally seems to be fairly robust to this parameter. The number of categories, i.e., the compression level, often drastically changes with $\gamma$ . For example, setting $\gamma=1$ (i.e., using standard fuzzy ART building blocks) versus $\gamma=2$ already yields noticeable changes in many data sets as shown in Figs. 13 through 15, especially for the DDVFA + Merge ART system. Furthermore, the number of categories appears to decrease by increasing $\gamma$ as this tendency was observed in many of the data sets in Figs. 13 through 15. Specifically, Fig. 17 illustrates this effect in the Target data set. These experimental results are consistent with previous findings in related work, in which improved memory compression is achieved when using power rules coupled with distributed learning in ART-systems [64, 65]. Another important aspect refers to the region of the dual vigilance parameter space which correlates with better performance; such a region seems to increase with the value of $\gamma$ for some data sets (e.g., the Target data set in Fig. 18), usually at the expense of the network’s compactness.

6 Conclusion

This paper presented DDVFA, a novel, modular, hierarchically self-consistent ART-based architecture for incremental, unsupervised learning. DDVFA features a number of innovations that differ from other ART-based systems. It relies on dual vigilance parameters to handle data quantization (local scale) and cluster similarity (global scale), features multi-prototype representations, and higher-order distributed activation and match functions. DDVFA consists of a global ART network whose nodes are local ART modules. The learning mechanism of the former is triggered by the feedback from the latter, thus enabling the system to capture arbitrary data distributions when using appropriate activation/match functions. DDVFA enables both one- and multi-category representations of clusters (i.e., one-to-one and one-to-many mappings of categories to clusters) according to the setting of the upper and lower vigilance parameter values.

Like fuzzy ART and DVFA, DDVFA is sensitive to input order presentation. This work also introduces a compatible Merge ART module that yields improved performance in the online mode where samples arrive in a random order and pre-processing cannot be employed. Experiments were conducted with random and VAT ordered samples. As expected, the latter approach yields better average performance ranks, and thus it is recommended in applications where the offline learning mode is available. Otherwise, for online incremental learning, the usage of a Merge ART module cascaded with DDVFA is recommended, given that the latter showed superior performance and less sensitivity to input presentation order.The VAT + DDVFA and DDVFA + Merge ART systems were found to be statistically equivalent in this papers’ experiments. Naturally, the type of distributed activation/match functions used for the similarity definition is data-dependent; the single-linkage-based ones typically yielded the best and second best average performance rank when cascading Merge ART and pre-processing with VAT, respectively. Conversely, weighted-based activation/match functions yielded the best average performance rank when solely using DDVFA. Naturally, as with other ART algorithms, the dual vigilance parameters must be carefully tuned.

The combination of DDVFA + Merge ART significantly outperformed fuzzy ART, DVFA, and topoART in most of the data sets with randomly presented samples, where a statistical difference was observed. Conversely, when pre-processing with VAT, no statistical difference was observed, except for in standard fuzzy ART. The compactness (i.e., number of categories created) of the networks generated by the multi-prototype ART-based architectures were also compared, and again, no statistical difference was observed. Furthermore, the clustering performance of these best performing DDVFA systems were compared with single-linkage HAC, DBSCAN, k-means and affinity propagation. The results indicated that these DDVFA systems are statistically equivalent to the first three clustering algorithms mentioned, and they all perform statistically better than affinity propagation. This is noteworthy since DDVFA-based systems are based on incremental learning, whereas all the other non-ART-based algorithms used batch learning.

Finally, this work investigated the effect of the parameter $\gamma$ in the behavior of DDVFA. The performance was robust toward this parameter, and with appropriate selection it can potentially increase the compactness (or equivalently, reduce the model complexity) of the DDVFA systems. This memory compression characteristic is consistent with findings from previous related work (distributed ART and ARTMAP systems), which combines power rules and distributed learning. Moreover, it was observed that $\gamma$ can extend the subspace of dual vigilance parameter combinations that yield effective performance.

Appendix A Derivation of the match function in DDVFA

This section contains the derivation of Eq. (11). Let $M_{\gamma}=M^{ART^{(1)}_{i}}_{j}$ be the activation function of category $j$ of $ART^{(1)}_{i}$ using $\gamma$ and $M_{\gamma^{*}}$ the activation function of the same category using $\gamma^{*}$ . Then, the normalized version of $M_{\gamma}$ with respect to $M_{\gamma^{*}}$ ( $M_{\gamma}^{n}$ ) is defined as

[TABLE]

The values of $\max(M_{\gamma^{*}})$ and $\max(M_{\gamma})$ are easily obtainable, since any point inside the hyperrectangular category representation would have this value, particularly the weight $\bm{w}=\bm{w}_{j}^{ART^{(1)}_{i}}$ of category $j$ itself. Furthermore, when using complement coding, $|\bm{x}|=d$ is a constant. The values $\min(M_{\gamma^{*}})$ and $\min(M_{\gamma})$ must be located at some corner of the d-dimensional unit hyperbox data space $[0,1]^{d}$ . These values can also be easily calculated for data sets with small dimensionalities. However, as the dimension increases, searching $2^{d}$ points quickly becomes impractical. Therefore, since a match function $M$ satisfies $0\leq M\leq 1$ by definition, a design decision was made to set $\min(M_{\gamma^{*}})=\min(M_{\gamma})=0$ in the normalization procedure. Hence,

[TABLE]

where the constant $c$ is inserted to safeguard against divisions by zero (since $0\leq\rho d\leq|\bm{w}|\leq d$ ). This parameter implies that $\bm{w}=\bm{x}$ no longer yields a match function value equal to $1$ . By making $c$ equal to the choice parameter $\alpha$ , then Eq. (16) becomes

[TABLE]

where $T_{\gamma}=T^{ART^{(1)}_{i}}_{j}$ is the activation function of category $j$ of $ART^{(1)}_{i}$ using $\gamma$ (Eq. (8)). Naturally, if $\gamma^{*}=0$ then $M_{\gamma}^{n}=T_{\gamma}$ , and for $\alpha\ll|\bm{w}|$ , if $\gamma=\gamma^{*}$ then $M_{\gamma}^{n}\approx M_{\gamma^{*}}$ (Eq. (10)).

Acknowledgment

This research was sponsored by the Missouri University of Science and Technology Mary K. Finley Endowment and Intelligent Systems Center; the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brazil (CAPES) - Finance code BEX 13494/13-9; and the Army Research Laboratory (ARL), and it was accomplished under Cooperative Agreement Number W911NF-18-2-0260. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

Bibliography126

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. Xu, D. C. Wunsch II, Survey of clustering algorithms, IEEE Trans. Neural Netw. 16 (3) (2005) 645–678.
2[2] R. Xu, D. C. Wunsch II, Clustering, Wiley-IEEE Press, 2009.
3[3] R. Xu, D. C. Wunsch II, Clustering algorithms in biomedical research: A review, IEEE Rev. Biomed. Eng. 3 (2010) 120–154.
4[4] G. A. Carpenter, S. Grossberg, A massively parallel architecture for a self-organizing neural pattern recognition machine, Comput. Vis. Graph. Image Process. 37 (1) (1987) 54–115.
5[5] D. Wunsch, ART properties of interest in engineering applications, in: Proc. Int. Joint Conf. Neural Netw. (IJCNN), 2009, pp. 3380–3383.
6[6] G. Bartfai, Hierarchical clustering with ART neural networks, in: Proc. Int. Conf. Neural Netw. (ICNN), Vol. 2, 1994, pp. 940–944.
7[7] A.-H. Tan, G. A. Carpenter, S. Grossberg, Intelligence Through Interaction: Towards a Unified Theory for Learning, in: D. Liu, S. Fei, Z.-G. Hou, H. Zhang, C. Sun (Eds.), Advances in Neural Networks – ISNN (Lecture Notes in Computer Science), Vol. 4491, Springer, Berlin, Heidelberg, 2007, pp. 1094–1103.
8[8] L. Meng, A. H. Tan, D. Xu, Semi-supervised heterogeneous fusion for multimedia data co-clustering, IEEE Trans. Knowl. Data Eng. 26 (9) (2014) 2293–2306. doi:10.1109/TKDE.2013.47 . · doi ↗