Incremental Semantic Mapping with Unsupervised On-line Learning

Ygor C. N. Sousa; Hansenclever F. Bassani

arXiv:1907.04001·cs.RO·July 12, 2019

Incremental Semantic Mapping with Unsupervised On-line Learning

Ygor C. N. Sousa, Hansenclever F. Bassani

PDF

TL;DR

This paper presents an incremental semantic mapping method using unsupervised online learning with Self-Organizing Maps, enabling robots to build topological maps enriched with semantic data and adapt to new environments without forgetting prior knowledge.

Contribution

It introduces a novel approach combining topological mapping and unsupervised online learning with SOMs for semantic place categorization in robotics.

Findings

01

Successfully demonstrated in real-world experiments

02

Enables continuous learning without degrading previous knowledge

03

Effectively clusters similar places based on semantic information

Abstract

This paper introduces an incremental semantic mapping approach, with on-line unsupervised learning, based on Self-Organizing Maps (SOM) for robotic agents. The method includes a mapping module, which incrementally creates a topological map of the environment, enriched with objects recognized around each topological node, and a module of places categorization, endowed with an incremental unsupervised learning SOM with on-line training. The proposed approach was tested in experiments with real-world data, in which it demonstrates promising capabilities of incremental acquisition of topological maps enriched with semantic information, and for clustering together similar places based on this information. The approach was also able to continue learning from newly visited environments without degrading the information previously learned.

Tables3

Table 1. TABLE I: Parameter ranges and final configurations.

Parameters	min	max	A	B
OLARFDSSOM
Activation threshold ( $a_{t}$ )	0.8	0.999	0.9879	0.9668
Lowest cluster percentage ( $l_{p}$ )	0.01	0.2	0.1914	0.1414
Relevance rate ( $β$ )	0.001	0.1	0.0163	0.0532
Max competitions ( $m a x c o m p$ )	5	150	34	89
Winner learning rate ( $e_{b}$ )	0.001	0.2	0.0118	0.0436
Neighbors learning rate ( $e_{n}$ )	0.0001	$e_{b}$	0.0076	0.0109
Relevance smoothness ( $s$ )	0.01	0.1	0.0781	0.0453
Connection threshold ( $c$ )	0	0.5	0.0301	0.1108
SEMMAP
Activation threshold ( $a_{t}$ )	-	-	0.5539	0.5539
Learning rate ( $e$ )	-	-	0.0139	0.0139
Summation limit ( $s_{t}$ )	2	15	5	7

Table 2. TABLE II: Results (Accuracy) of Constante et al . [ 21 ] compared with the results obtained with the proposed approach (PA).

Seq.1/Path	Seq.2/Path	SPMK	SPACT	Both	PA
Freiburg/1	Freiburg/2	0.7810	0.6204	0.8267	0.6976
Freiburg/1	Saarbrücken/1	0.5019	0.5765	0.6127	0.8148
Freiburg/1	Saarbrücken/2	0.5320	0.5620	0.6089	0.5128

Table 3. TABLE III: Results of the experiment with all selected data. Standard deviations found in parentheses.

Train/Test	CE	Accuracy	Clusters	Categories


Both/Freiburg	0.601(0.02)	0.678(0.02)	9.40(1.54)	11
Freiburg/Freiburg	0.582(0.02)	0.650(0.02)	8.03(1.90)	8
Saarbrucken/Freiburg	0.568(0.03)	0.660(0.04)	10.67(2.66)	9
Both/Saarbrucken	0.454(0.02)	0.540(0.03)	9.13(1.74)	11
Saarbrucken/Saarbrucken	0.435(0.03)	0.541(0.02)	10.46(2.23)	9
Freiburg/Saarbrucken	0.444(0.03)	0.522(0.02)	8.13(1.70)	8

Equations12

s (p) = j arg max [a c (D (p, c_{j}))] .

s (p) = j arg max [a c (D (p, c_{j}))] .

a c (D (p, c_{j})) = \frac{1}{1 + D ( p , c _{j} )},

a c (D (p, c_{j})) = \frac{1}{1 + D ( p , c _{j} )},

D (p, c_{j}) = i = 1 \sum m (p_{i} - c_{j i})^{2}

D (p, c_{j}) = i = 1 \sum m (p_{i} - c_{j i})^{2}

c_{s} (n + 1) = c_{s} (n) + e (p - c_{s} (n)) .

c_{s} (n + 1) = c_{s} (n) + e (p - c_{s} (n)) .

ϕ_{s i} = {s_{t} ϕ_{s i} + r_{i} if ϕ_{s i} + r_{i} > s_{t}, otherwise.

ϕ_{s i} = {s_{t} ϕ_{s i} + r_{i} if ϕ_{s i} + r_{i} > s_{t}, otherwise.

o_{s i} = lo g_{1 + s_{t}} (1 + ϕ_{s i}),

o_{s i} = lo g_{1 + s_{t}} (1 + ϕ_{s i}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSelf-Organizing Map

Full text

Incremental Semantic Mapping with Unsupervised On-line Learning

Ygor C. N. Sousa and Hansenclever F. Bassani, Member, IEEE

*Center of Informatics - CIn

Federal University of Pernambuco

*Recife, PE, Brazil, 50.740-560

Email: {ycns, hfb}@cin.ufpe.br

Abstract

This paper introduces an incremental semantic mapping approach, with on-line unsupervised learning, based on Self-Organizing Maps (SOM) for robotic agents. The method includes a mapping module, which incrementally creates a topological map of the environment, enriched with objects recognized around each topological node, and a module of places categorization, endowed with an incremental unsupervised learning SOM with on-line training. The proposed approach was tested in experiments with real-world data, in which it demonstrates promising capabilities of incremental acquisition of topological maps enriched with semantic information, and for clustering together similar places based on this information. The approach was also able to continue learning from newly visited environments without degrading the information previously learned.

I Introduction

The idea of semantic mapping is to provide human-centered models of the environment for robots [1] that can be used in communication and reasoning. These methods usually receive as input a flux of low-level sensory data, i.e., from LIDARs, cameras, and expand the traditional mapping models (metric or topological map), including higher level semantic concepts that make sense for humans in terms of communication in natural language [2].

Semantic mapping is applied in different tasks involving interaction between humans and robots. Duvallet [3] used a semantic map in a wheelchair, allowing it to be controlled by natural language commands as “go past the kitchen down the hall and then take a right”. Walter et al. [2] extended this idea by incorporating natural language descriptions of places as another source of sensory data. In another study, Walter et al. [4] used a semantic map to allow the understanding of natural language commands given to a forklift.

Current semantic mapping methods usually populate the map with semantic properties organized in a set of predefined types, such as environment size (eg.:large, small), place category (eg.: kitchen, bathroom) and the objects present (eg.: table, chair, TV, sofa). These properties are typically estimated by supervised classifiers from images, low-level sensory data or other extracted semantic properties [2, 1, 5].

Regarding the topological aspects of environment mapping, most works on the literature apply an incremental approach, creating the map progressively as the robotic agent navigates. The semantic information recognized in the environment is then incorporated into the map. However, the place categorization of the mapped environments is usually done by methods with non-incremental supervised learning [1, 6, 7]. More recently, authors have tried to modify supervised learning methods to add incremental learning [8] in order to overcome this limitation.

For robots that are intended to have a long lifetime, that should keep learning as they interact with humans and navigate through different and changing environments, it is necessary to develop appropriate long-term learning methods that can incorporate knowledge incrementally without degrading its performance or requiring retraining [9].

Therefore, although the traditional non-incremental supervised learning approaches may provide a good performance in a short time or for specific tasks, they may not be adequate for long term learning, since they would require frequent human intervention to update its architecture to specify new categories to be learned and for retraining with new data.

Alternatively, unsupervised learning approaches could minimize the need for such interventions, especially if associated with incremental learning, allowing the incorporation of new knowledge more easily. Additionally, if on-line training would also be associated, the incorporation of new knowledge as more information is available would be possible and in real time, hence, improving robot reasoning and communication capacities with time.

In this context, this work presents a semantic mapping approach with unsupervised, on-line and incremental learning, which is based on Self-Organizing Maps (SOM) with time-varying structure [10]. The proposed approach comprises four modules: (i) a Metric SLAM Module that yields the position of the robot; (ii) an Object Recognition Module that recognizes objects in images; (iii) the Semantic Mapping Module (SEMMAP) that creates a topological map of the environment, enriched with the objects recognized around each topological node, the only semantic information to be used to determine the place categories; and (iv) a place categorization module, denominated OLARFDSSOM, that categorizes the place of each topological node, taking as input the semantic information stored on it.

The first two modules are not the focus of the present work, and any suitable SLAM and object recognition methods can be used. The last two modules learn incrementally and are trained in an unsupervised on-line fashion. Human supervision is required only for labeling linguistically the new place categories already discovered by the module.

The proposed approach was tested with the real world data provided by Pronobis and Caputo [11] and presented promising results. In all experiments, the model was able to create an adequate topological map and to cluster together similar places visited in the explored environments with few errors. The model was also able to learn from new visited environments without degrading the information previously learned.

The following sections of this article are organized as follows: Section II presents a short review of the related work on semantic mapping. Section III presents the proposed approach, which is evaluated in with the experiments presented in Section IV. Finally, the conclusions are presented in Section V.

II Related Work

Most semantic mapping methods found in the literature are typically focused on the automatic interpretation of perceptions [12], which includes the inference of the places categories present in the environment. The approach presented by Kostavelis and Gasteratos [6] can be used as an example, the authors introduced a semantically annotated topological mapping model that uses a Support Vector Machine (SVM) to infer the place categories.

Another example is the model introduced by Pronobis and Jensfelt [1], it uses multiple sensors to recognize different semantic features of the environment. The recognition of objects and environment appearance are done through computer vision techniques, while the sizes and shapes of the rooms are extracted using lasers scanners. The information acquired is classified by SVM models and a probabilistic model of Chain Graph [13] is used to infer the categories of places. In addition, Bastianelli et al. [12] and Gemignani et al. [14] presented models that incrementally add objects pointed by users to multi-layered semantic maps. They, however, do not perform automatic categorization of places.

Unlike the others, Sunderhauf et al. [8] presented a model that incrementally learns new categories of place, but still, requires supervision in its learning process. In this work, a convolutional neural network is extended by a one-vs-all Random Forest classifier that learns new place categories in a supervised fashion.

Despite that one of the mentioned methods enable incremental learning at the places categorization step, all of them require some certain level of supervision and do not conduct on-line training. In the literature of unsupervised machine learning, there are methods that are able to incrementally learn data categories. The methods derived from the Self-Organizing Maps with Time-Varying Structure are good candidates. These are a type of neural network in which nodes compete, cooperate and are created incrementally to cluster the input data. Bassani and Araujo [15] introduced a SOM model of such kind, called LARFDSSOM, that can also deal with high dimensional input data. This family of models inspired the approach presented in this article, which will be detailed in the next section.

III Proposed Approach

The proposed approach comprises four modules, which acquire and organize semantic information from the environment. They are:

•

I - Metric SLAM: A method of Simultaneous Localization and Mapping of the environment that returns the current position ( $x$ , $y$ ) of the robot. This module was not implemented in this work since the position was already provided in the dataset considered.

•

II - Object Recognition: A method that takes as input an image and recognizes a predefined set of objects that may be present on it, then outputs a vector, $\mathbf{r}$ , in which each component represents a level of certainty in the [0,1] interval, where zero means that the respective object was not recognized and one indicates that the object was recognized with a high level of certainty. In this work we use a pre-trained model called Inception-v3 [16], available from [17].

•

III - SEMMAP: A SOM that builds a topological map of the environment, enriched with semantic information, which, in this work, consists solely of information about the objects recognized around each topological node.

•

IV - OLARFDSSOM: An on-line version of the SOM proposed by Bassani and Araujo [15] that clusters the semantic information stored on the nodes of SEMMAP into categories that aim to represent the types of places visited (eg.: kitchen, corridor, office, etc.).

The architecture presented in Fig. 1 illustrates the information flow between the modules. The sensors that provide data for modules I and II can include LIDARs, Gyroscopes, Regular and Omnidirectional Cameras, etc., according to the techniques employed by modules I and II to determine the position and recognize the basic semantic information from the sensory data.

The following subsections describe in detail SEMMAP and OLARFDSSOM.

III-A SEMMAP

SEMMAP is a SOM based semantic mapping method that creates topological maps incorporating semantic information captured from the environment. The topological representation starts empty and is incrementally created in a graph form, as the agent moves around.

The map is represented by a graph $G=(\mathbf{V},\mathbf{E})$ , where, $\mathbf{V}=\{v_{j},j=1...k\}$ , is a vector of nodes that represents locations on the map, and $\mathbf{E}=\{e_{i},i=1...l\}$ , is a vector that represents transition relations between nodes. Each node $j$ on the map is associated with three vectors: $\mathbf{c}_{j}=\{c_{ji},i=1...m\}$ , represents the spatial position of the node (center), $\mathbf{o}_{j}=\{o_{ji},i=1...n\}$ ( $o_{ji}\in[0,1]$ ), represents the objects recognized around the position $\mathbf{c}_{j}$ , and $\boldsymbol{\phi}_{j}=\{\phi_{ji},i=1...n\}$ is a vector that accumulates the certainty level of the objects recognized, where, $\phi_{ji}\in[0,s_{t}]$ and $s_{t}$ is a parameter that defines an upper limit of accumulation. The vector $\boldsymbol{\phi}_{j}$ is used only to compute $\mathbf{o}_{j}$ .

The operation of SEMMAP, as well as of a regular SOM [18], comprises three steps: competition, adaptation, and cooperation. Below we describe how each of these operations is performed in SEMMAP and the Alg. 1 summarizes them.

III-A1 Competition

In SEMMAP, during the competition step, the nodes on the map compete to cluster the input data, which in the case, is the position received from Module I. The winner of a competition is the most active node according to a radial basis function, i.e., the nearest node to the input position. Whenever the winner node does not achieve a certain activation threshold, a new node is introduced in the map at the position of the input data.

The position input patterns, $\mathbf{p}$ , provided by Module I are presented to the map as the agent moves around the environment, where $\mathbf{p}=(x,y)$ , since it usually represents the position of the agent on the horizontal plane.

When a position input pattern is presented, a competition occurs to determine which node better represents $\mathbf{p}$ . The winner of the competition, $s(\mathbf{p})$ , is the node that presents the higher activation for the input pattern:

[TABLE]

The activation of a node, $ac(D(\mathbf{p},\mathbf{c}_{j}))$ , is computed as a function of the euclidean distance between the input pattern and the node center:

[TABLE]

where $D(\mathbf{p},\mathbf{c}_{j})$ is calculated as a traditional euclidean distance, as follows:

[TABLE]

In a competition, if no node achieves the activation threshold $a_{t}$ , or if the map is empty, then a new node, $\eta$ , is inserted into the map, with $\mathbf{c}_{\eta}=\mathbf{p}$ , $\boldsymbol{\phi}_{\eta}=\mathbf{r}$ , and $\mathbf{o}_{\eta}$ initialized as per Eq. 6 (line 9 in Alg. 1). Otherwise, the winner node is updated as described following.

III-A2 Adaptation

In the adaptation step, the winner node is adapted to approximate its position to the position of the input data. Therefore, the three vectors associated with the winner node, $s$ : $\mathbf{c}_{s}$ , $\boldsymbol{\phi}_{s}$ and $\mathbf{o}_{s}$ , are updated (lines 14-16 in Alg. 1).

The vector $\mathbf{c}_{s}$ is updated taking into account a learning rate $e\in{]0,1[}$ as follows:

[TABLE]

Although in the current version the semantic properties provided by Module II were not used during the competition, they are accumulated by the winner node. The input patterns representing the objects recognized by Module II, $\mathbf{r}$ , are presented to the map as the images are processed, where, $\mathbf{r}=\{r_{i},i=1...n\}$ is a vector containing the certainty level of recognition of each object, $r_{i}\in[0,1]$ , and $n$ is the number of different objects that Module II can recognize.

In order to estimate the new value of the object vector, $\mathbf{o}_{s}$ , first $\boldsymbol{\phi}_{s}$ is updated. This vector accumulates the evidence about the presence of the objects in the surroundings of the node $s$ . This strategy aims to mitigate the problem of object occlusion by collecting data from the different viewpoints. To achieve that, objects certainty values, $\mathbf{r}$ , are accumulated in $\boldsymbol{\phi}_{s}$ through a summation limited by $s_{t}$ , as follows:

[TABLE]

Then, we compute each component, $i$ , of the object vector, $\mathbf{o}_{s}$ , as a log function of the respective component in $\boldsymbol{\phi}_{s}$ :

[TABLE]

where $s_{t}$ is the upper limit used in Eq. 5, applied here to ensure that each component $\mathbf{o}_{si}$ is in $[0,1]$ interval.

III-A3 Neighborhood and Cooperation

In the cooperation step, the neighborhood of the winner node is updated. In SEMMAP the neighborhood is formed during the transitions of the agent between two nodes on the map, i.e., nodes that are consecutive winners are connected. The same happens when a new node is inserted into the map: the new node is connected to the previous winner if any (lines 10 and 18 in Alg. 1).

Differently from the usual SOMs, in SEMMAP the position of the neighbors are not updated and the connections are only used to represent the navigability between the nodes. We intend to better explore the cooperation of the neighborhood in future versions.

Another important aspect of SEMMAP is that the semantic properties stored on the previously visited node are sent to the next module, OLARFDSSOM, whenever a transition occurs, more specifically, when the agent moves from a node $j_{a}$ to the node $j_{b}$ , the information stored on $\mathbf{o}_{ja}$ is sent to the OLARFDSSOM for training (lines 11 and 18 in Alg. 1). This moment was chosen considering that, at that point, node $j_{a}$ would have accumulated a significant amount of semantic information about its surroundings, that is expected to be sufficient to describe the category of place.

III-B OLARFDSSOM

The acronym stands for On-line Local Adaptive Receptive Field Dimension Selective Self-Organizing Map. The proposition here is to introduce an on-line version of LARFDSSOM, a SOM with time-varying structure proposed by Bassani and Araujo [15]. The LARFDSSOM itself is considered a subspace clustering method, that can find clusters and identify their relevant dimensions, simultaneously, during the self-organization process, with unsupervised and incremental learning. Like the original version, OLARFDSSOM is a general-purpose method that could be applied to different problems, but has so far only been tested in this situation.

In the present work, the input data to be clustered are the semantic properties (object certainty vectors) collected by SEMMAP, and the clusters formed are expected to represent the different place categories visited by the agent. We consider that LARFDSSOM is a suitable method for this task because it employs a locally weighted distance metric to adjust the relevances of the input dimensions. This is an important property when the input data presents high dimensionality, just as the objects vectors provided by SEMMAP may present. Therefore, LARFDSSOM is able to identify which objects are relevant for determining each place category. As an example, the map can learn that the presence of a TV in a kitchen is not so relevant to recognize this kind of place as the presence of a stove or a sink is. These relevances are automatically adjusted for each cluster.

In OLARFDSSOM the steps of competition, adaptation, and cooperation are done similarly as in LARFDSSOM. We refer the reader to the original paper [15] for the details about these procedures. However, the original method was not intended to operate with on-line data input, since it is operated in three phases: self-organization, convergence, and clustering. So, in the on-line version presented here, the model is operated in two procedures that occur in parallel: the training and clustering procedures. The main difference between both procedures is that the adaptation and cooperation steps occur only during the training procedure.

Both procedures are described below and summarized in a form of pseudo-code in Alg. 2 (training procedure) and Alg. 3 (clustering procedure).

III-B1 Training Procedure

The training procedure is done after the initialization of the network parameters, whenever a new training pattern $x$ is presented by its inputs. Similarly, as in LARFDSSOM, the first step of the training procedure is the competition, which determines the winner node, $s$ . Then, if the activation of the winner node is below the threshold $a_{t}$ , a new node is inserted into the map, at the position of the training pattern. The new node is initialized and connected to other nodes (lines 8-10 in Alg. 2). If the activation of the winner is above or equal to the $a_{t}$ , then, the adaptation and cooperation steps are done (lines 12-13 in Alg. 2).

In LARFDSSOM, each node $j$ in the map stores a variable, $wins_{j}$ , that accounts for the number of wins of this node with activations not lower than the threshold, since the last reset. A reset occurs after $maxcomp$ competitions, this is the moment when the nodes that present a number of wins below the limit $lp\times maxcomp$ are removed from the map, where $lp$ is a parameter representing the lowest percentage of wins allowed for a node in the map. To avoid the removal of recently created nodes, when a node is created its number of wins is set to $lp\times nwins$ , where $nwins$ is the number of competitions that have occurred since the last reset (lines 15-18 in Alg. 2).

Finally, in LARFDSSOM, after the node removal, the number of wins of the remaining nodes is reset to zero. This procedure is not done in OLARFDSSOM to avoid the removal of nodes that represent categories of places that were not recently visited. Therefore, nodes that achieve a number of wins equal or greater than $lp\times maxcomp$ will never be removed from the map.

III-B2 Clustering Procedure

The clustering procedure consists of assigning an input pattern to a cluster. In LARFDSSOM it was done only after the self-organization and convergence phases were finished. In OLARFDSSOM, it may occur at any moment, in parallel with the training procedure, and the result will reflect the current state of the map.

In OLARFDSSOM each cluster is associated with a unique id that can be retrieved whenever it is necessary to determine which kind of place the agent is in. This id can be further associated with a linguistic label of a place, such as “kitchen” or “office” for the means of communication in natural language. Thus, the OLARFDSSOM can output the current place category (cluster id) to any SEMMAP node from its semantic properties (object certainty vector) as input pattern, as needed. The Alg. 3 details the procedure.

IV Experiments

In this section, we describe the experiments carried out with the proposed approach. They aimed to evaluate the quality of the obtained maps, both in terms of precision of the topological mapping and of semantic acquisition. In the following subsections, we first describe the dataset considered (Section IV-A), then the evaluation measures chosen (Section IV-B) and how the parameter adjustment was made (Section IV-C). The obtained results are presented and discussed in Sections IV-D and IV-E.

IV-A Dataset

In this work, we use the COLD dataset (COsy Localization Database) provided by Pronobis and Caputo [11]. The dataset consists of three separated sub-datasets acquired in three different laboratories, each located in a different European city (Freiburg, Ljubljana, and Saarbrucken). Each sub-dataset comprises a sequence of images captured with regular and omnidirectional cameras, along with position data obtained via odometry and laser range scans, as the robotic platform moves in different paths on the facilities. This dataset was chosen to be used in the experiments especially because it contains images and position data.

In total, there are 76 data sequences of 9 different paths in the dataset, 26 of 3 paths in Freiburg, 18 of 2 paths in Ljubljana and 32 of 4 paths in Saarbrucken. However, due to imprecisions found in the position data of several data sequences, only 18 sequences of 6 paths (3 sequences of each path) were used, 6 sequences of 2 paths from Freiburg, and 12 sequences of 4 paths from Saarbrucken. In the 6 used paths, there are 11 different categories of places and each chosen path contains a subset of these categories.

In this work, we use only the images captured with the regular camera and the respective positions of their acquisition on the environment, $\mathbf{p}=(x,y)$ .

For each image of the considered dataset, we run the pre-trained object recognition method, Inception-v3 [16], available on the TensorFlow library [17]. We defined a set of 18 objects to be recognized. They are: window shade, bookcase, electric fan, couch, washbasin, soap dispenser, toilet seat, photocopier, monitor, desktop computer, desk, table, chair, banister, microwave oven, stove, dishwasher, and toaster. Each object is recognized with a certainty degree in the [0,1] interval, where zero denotes no object recognized and one the maximum level of certainty. Therefore, each image was transformed into an 18-dimensional vector of certainty levels, $\mathbf{r}$ , which was paired with the respective position of acquisition, $\mathbf{p}$ .

IV-B Evaluation Measures

In this work, we considered two evaluation measures: Accuracy, which is widely used in the literature for evaluating place categorization, and Clustering Error (CE) [19]. We consider CE as a better measure for comparing clustering methods that do not necessarily produce a same number of clusters, since it penalizes results with more clusters than necessary, while Accuracy tends to grow with the purity of the clusters, regardless of the number of clusters found.

IV-C Parameter Adjustment

In order to find adequate values for the several parameters of SEMMAP and OLARFDSSOM, except for the parameters $a_{t}$ and $e$ of the SEMMAP module that were previously manually defined due to the fact that they directly affect the construction of the topological map, we ran a parameter sampling technique known as Latin Hypercube Sampling (LHS) [20] and recorded the best results achieved by the approach. In the LHS, the parameters are sampled within previously established ranges, where the range of each parameter is divided into subintervals of equal probability and a single value is chosen randomly from each subinterval. The ranges we used for each parameter are presented in Table I and almost all the experiments described below used the same final parameter configuration, which is also presented in Table I named as configuration A. The exception was the comparison experiment described in Section IV-E which used another final configuration of parameters in order to best fit the conditions of comparison, the configuration B presented in Table I.

After analyzing the LHS results, it was possible for us to identify three parameters that affect more significantly the performance of the approach and should be more carefully adjusted: $a_{t}$ , $maxcomp$ , and $l_{p}$ of the OLARFDSSOM.

IV-D Evaluation of the Topology

The topology of the maps produced by the SEMMAP was evaluated considering two features: the position of the nodes and the connections between them. The data sequences were presented to the approach and the maps produced by the SEMMAP module were evaluated.

First, the position of each of the 695 nodes created for the 18 maps obtained (coming from the 18 data sequences used) was visually inspected by plotting diagrams in which the node positions are displayed over the positions of the input data. The Fig. 2 presents a typical example of the results obtained. Such diagrams allowed us to conclude that the nodes were adequately placed in all paths considered.

In order to evaluate the connections between nodes, we evaluated each of the 710 connections formed, verifying if they represent viable paths in the environment. Out of the 710 connections evaluated, only one was incorrectly inserted, what represents an accuracy of 0.9986. We attribute the misplaced connection to a transitory error in the estimated coordinates provided by the dataset.

IV-E Evaluation of the Semantic Map

In the experiments described below, as the semantic map was built by SEMMAP with the input data from Modules I and II, OLARFDSSOM was trained to categorize the places visited, receiving as input the vector of objects stored on the nodes of SEMMAP on each transition. First, we present a comparison of the proposed approach with an image categorization method proposed by Constante et al. [21]. Afterwards, we present an experiment with all the selected paths from the COLD dataset and then an evaluation about the categorization performance over time.

IV-E1 Comparison

A current difficulty in the literature of semantic mapping is the lack of comparability between the results of the different proposed approaches. In order to establish how challenging was the dataset at hand, we considered the results presented by Constante et al. [21] as a reference. This work introduced a place categorization method that uses knowledge from previously labeled places for the categorization of new environments through an unsupervised transfer learning task. The categorization process is done off-line and frame by frame, so only local information contained in each image is analyzed at a time and no semantic map of the environments is created. The method considers two types of image descriptors known as SPMK [22] and SPACT [21], each of them is tested separately and then combined.

To evaluate the capabilities of the model, the authors used pairs of data sequences from distinct paths from the COLD dataset. The first data sequence was always previously fully labeled and presented to the method that used this information to categorize the second data sequence. The proposed approach was tested on the same data sequences. However, it was trained with unlabeled data of both data sequences and tested only on the second data sequence.

The results obtained are presented on Tab. II in values of Accuracy. As one can notice, the results of the proposed approach are similar to the results presented in [21], which we considered an satisfactory result, since labeled data is not used in the proposed approach and the method works in an on-line and incremental fashion as it creates the semantic map and recognizes the place categories. It was not possible to carry out a statistical test due to the different nature of both methods.

IV-E2 Experiment with All Selected Data

In this experiment, the proposed approach was trained with all previously selected data sequences from both sub-datasets (Freiburg and Saarbrucken), then we evaluated its categorization performance against the ground truth in the data sequences from each sub-dataset separately, conditions: Both/Freiburg and Both/Saarbrucken111Notation: [training sub-datasets]/[test sub-datasets], ex.: in Both/Freiburg the model was trained with both sub-datasets and tested on Freiburg.. Additionally, the model was also evaluated in the conditions Freiburg/Freiburg, Saarbrucken/Freiburg, Saarbrucken/Saarbrucken and Freiburg/Saarbrucken. This learning procedure was repeated 30 times with a random selection of data sequences, then we calculated averages and standard deviations for both evaluation metrics.

The results obtained are presented in Tab. III. As one can notice, in conditions Both/Freiburg and Both/Saarbrucken the results are quite similar or slightly superior to all the others, what was confirmed with a statistical test. This suggests that the model does not degrade in performance as more data is feed into it. Furthermore, it is important to see that the results in conditions Saarbrucken/Freiburg and Freiburg/Saarbrucken are respectively quite similar to Freiburg/Freiburg and Saarbrucken/Saarbrucken, which shows more evidence of the generalizing power of the proposed approach. In Tab. III we display also the number of categories found by OLARFDSSOM in comparison with the ground truth. We notice the method has found a similar number of clusters in all cases.

An illustration of the typical semantic map obtained with the proposed approach is presented in Fig. 3 for a data sequence of a Freiburg path. The color of each node represents one place category found by the model and the dashed squares indicate the expected categories according to the ground truth.

IV-E3 Over Time Evaluation

In order to evaluate the categorization performance of the proposed approach over time, it was sequentially trained with a random selection of all the data sequences previously selected, with the categorization of each data sequence being evaluated against the ground truth in two moments: right after training the data sequence and after training all data sequences. This aims to verify if the model degrades its performance in the early trained sequences after being trained with other sequences.

The results obtained are graphically shown in Fig. 4 and, as can be seen, the values of both evaluation measures after training all sequences were mostly (83,4% of cases, in CE and 77,8%, in accuracy) similar or superior to those obtained after training each sequence. This gives us an indication about the behavior of the model, when it is feed with real-time data, displaying its capacity of learning incrementally without apparently degrading its categorization performance in previously learned sequences.

V Conclusion

This paper presented an on-line incremental semantic mapping approach, with unsupervised learning, based on Self-Organizing Maps with Time-Varying Structure. The approach builds topological maps enriched with objects recognized around each node as semantic information. This information is used in real time by an unsupervised learning method to incrementally, and in an on-line fashion, form clusters representing categories of places visited by the agent. The nodes of the topological maps can be categorized at any time with the current state of the unsupervised place category learning method. To the best of our knowledge, this is the first semantic mapping approach with the mentioned characteristics, thus enabling an agent to build semantic maps and learn place categories in real time, as it moves around the environment.

Moreover, the categorization results obtained were promising, as the place categories found were mostly coherent with the ground truth, grouping together most nodes of the same category, without degrading its capacity with time. The model still presents difficulties for grouping properly nodes located in zones of transition. This was expected since we were using only objects as semantic information and objects can be seen by the camera even before entering a new room.

There are several ways in which the proposed approach could be extended, however, for future work, we first intend to incorporate other kinds of semantic information such place geometry, size, and linguistic data.

Acknowledgment

The authors would like to thank the Brazilian National Counsel of Technological and Scientific Development (CNPq) for supporting this work.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Pronobis and P. Jensfelt, “Large-scale semantic mapping and reasoning with heterogeneous modalities,” in International Conference on Robotics and Automation . IEEE, May 2012, pp. 3515–3522.
2[2] M. R. Walter, S. Hemachandra, B. Homberg, S. Tellex, and S. Teller, “A framework for learning semantic maps from grounded natural language descriptions,” The International Journal of Robotics Research , vol. 33, no. 9, pp. 1167–1190, 2014.
3[3] F. Duvallet, “Natural language direction following for robots in unstructured unknown environments,” Ph.D. dissertation, Carnegie Mellon University, 2015.
4[4] M. R. Walter, M. Antone, E. Chuangsuwanich, A. Correa, R. Davis, L. Fletcher, E. Frazzoli, Y. Friedman, J. Glass, J. P. How, J. Jeon, S. Karaman, B. Luders, N. Roy, S. Tellex, and S. Teller, “A situationally aware voice-commandable robotic forklift working alongside people in unstructured outdoor environments,” Journal of Field Robotics , vol. 32, no. 4, pp. 590–628, 2015.
5[5] I. Kostavelis and A. Gasteratos, “Semantic mapping for mobile robotics tasks: A survey,” Robotics and Autonomous Systems , vol. 66, pp. 86 – 103, 2015.
6[6] ——, “Learning spatially semantic representations for cognitive robot navigation,” Robotics and Autonomous Systems , vol. 61, no. 12, pp. 1460 – 1475, 2013.
7[7] ——, “Semantic maps from multiple visual cues,” Expert Systems with Applications , vol. 68, pp. 45 – 57, 2017.
8[8] N. Sunderhauf, F. Dayoub, S. Mcmahon, B. Talbot, R. Schultz, P. Corke, G. Wyeth, B. Upcroft, and M. Milford, “Place categorization and semantic mapping on a mobile robot,” in International Conference on Robotics and Automation (ICRA) , May 2016, pp. 5729–5736.