ViTa-SLAM: A Bio-inspired Visuo-Tactile SLAM for Navigation while Interacting with Aliased Environments
Oliver Struckmeier, Kshitij Tiwari, Mohammed Salman, Martin J. Pearson, and Ville Kyrki

TL;DR
ViTa-SLAM is a novel bio-inspired SLAM framework that integrates vision and tactile sensing, enabling robots to navigate and recognize environments even in ambiguous or aliased scenes by fusing multi-sensory data.
Contribution
It introduces a new multi-sensory SLAM approach combining vision and tactile data, improving robustness in ambiguous environments over previous visual-only or tactile-only methods.
Findings
Successfully handles ambiguous scenes with sensor fusion.
Enhances environment recognition through combined visuo-tactile data.
Provides a natural interaction mechanism during navigation.
Abstract
RatSLAM is a rat hippocampus-inspired visual Simultaneous Localization and Mapping (SLAM) framework capable of generating semi-metric topological representations of indoor and outdoor environments. Whisker-RatSLAM is a 6D extension of the RatSLAM and primarily focuses on object recognition by generating point clouds of objects based on whisking information. This paper introduces a novel extension to both former works that is referred to as ViTa-SLAM that harnesses both vision and tactile information for performing SLAM. This not only allows the robot to perform natural interaction with the environment whilst navigating, as is normally seen in nature, but also provides a mechanism to fuse non-unique tactile and unique visual data. Compared to the former works, our approach can handle ambiguous scenes in which one sensor alone is not capable of identifying false-positive loop-closures.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
ViTa-SLAM: A Bio-inspired Visuo-Tactile SLAM for Navigation while Interacting with Aliased Environments
Oliver Struckmeier*∗, Kshitij Tiwari∗*, Mohammed Salman, Martin J. Pearson, and Ville Kyrki ∗ The authors have equal contribution.This research has received funding from the European Union’s Horizon 2020 Framework Programme for Research and Innovation under the Specific Grant Agreement No. 785907 (Human Brain Project SGA2).Oliver Struckmeier, Kshitij Tiwari, and Ville Kyrki are with the Department of Electrical Engineering and Automation, Aalto University, Espoo 02150, Finland {firstname.lastname}@aalto.fiMohammed Salman is with the Bristol Robotics Laboratory, University of Bristol and University of the West of England, Bristol, UK ([email protected])Martin J. Pearson is with the Bristol Robotics Laboratory, Bristol BS16 1QY, U.K ([email protected])
Abstract
RatSLAM is a rat hippocampus-inspired visual Simultaneous Localization and Mapping (SLAM) framework capable of generating semi-metric topological representations of indoor and outdoor environments. Whisker-RatSLAM is a D extension of the RatSLAM and primarily focuses on object recognition by generating point clouds of objects based on tactile information from an array of biomimetic whiskers. This paper introduces a novel extension to both former works that is referred to as ViTa-SLAM that harnesses both vision and tactile information for performing SLAM. This not only allows the robot to perform natural interactions with the environment whilst navigating, as is normally seen in nature, but also provides a mechanism to fuse non-unique tactile and unique visual data. Compared to the former works, our approach can handle ambiguous scenes in which one sensor alone is not capable of identifying false-positive loop-closures.
I Introduction
Robots are often equipped with off-the-shelf sensors like cameras which are used for navigation, however, vision is sensitive to extremes in lighting conditions such as shadows or unpredictable changes in intensity as shown in Fig. 1(a). Whilst other on-board sensors like laser range finders can be used in such situations they too are impaired by reflective and absorbing surfaces. Similarly, sensory systems as they occur in nature are subject to impairments, e.g., a rat moving through a maze in ill-lit conditions as illustrated in Fig. 1(b). However, through the process of evolution nature has equipped animals to gracefully accommodate such scenarios. Given the coarse vision and challenges of a rodent’s natural environment, they are known to rely on tactile feedback derived form whiskers aside from vision to decipher their own location. Considering the example depicted in Fig. 1(b), a rat navigates a maze where in certain locations visual or tactile information is ambiguous but combining tactile and visual information can help to discern similar locations. Conventional robots lack such a robust capability to interact with their environment through contact. Thus, biomimetic robots are gaining traction [1] which has now made it possible to harness visual and tactile sensory modalities for informed decision making. However, it still remains unclear how to best process and combine information from disparate sensory modalities to aid in spatial navigation.
Previous works on visuo-tactile sensor fusion like [2, 3] usually combine sensory modalities of varying sensing ranges. The key requirement of these methods was the need to have a redundant field-of-view. Other works in this domain like [4, 5, 6] have mainly focused on creating dense haptic object/scene maps. Whilst these methods allow for environmental interactions, they are primarily designed for tactile object exploration and grasping. Although, tactile sensing is increasingly being used in these domains, it remains yet to be applied for performing Simultaneous Localization and Mapping (SLAM).
In the context of SLAM, previous works have demonstrated the strengths of a bio-inspired SLAM system and shown its application using single-sensory modalities such as either vision [7], sonar [8] or WiFi [9]. However, such methods rely on the uniqueness of the data and are thus susceptible to false-positive place recognition. This problem was previously addressed by fusing information from an array of active sensors each providing rich information [10]. Despite the robustness to illumination changes, this method is not capable of fusing non-unique sensory information.
To address the challenges of place recognition in aliased environments using multiple sensors we present our novel method of identifying and preventing false-positive place recognition by combining long-range (unique) vision, with short-range (non-unique) tactile information. Additionally, the proposed method does not rely on sensory redundancy. Our preliminary results presented in [11] showed that the method presented herewith is capable of preventing false-positive place recognition from a vision-only SLAM system. Subsequently, a robust sensor fusion algorithm has been developed to integrate information from unique and non-unique sensory modalities such as cameras and whiskers, respectively. Additionally, performance metrics are presented herewith to compare and evaluate model performance against vision or tactile only sensing.
II Bio-inspired SLAM
This work draws inspiration from two well-known bio-inspired SLAM frameworks: RatSLAM, a rat hippocampal model based visual SLAM architecture [7]; and Whisker-RatSLAM, an extension of RatSLAM aimed primarily at tactile object exploration [12]. This work relies on modified variants of these models referred to as Visual-SLAM and Tactile-SLAM. In this section these models are summarized and the differences from the works in [7, 12] are shown. Lastly, the proposed ViTa-SLAM model is introduced.
For each of the models, an overall system architecture is provided using the following convention: nodes represented by right isosceles triangles represent raw sensory data; nodes represented by ellipse(s) represent pre-processing of sensory data before they are converted to input features represented by rounded boxes. The outputs from the models are represented by regular boxes, the pre-processing and feature generation stages are highlighted in light blue and light red blocks, respectively.
II-A Visual-SLAM
When investigating the way rodents navigate from a bio-inspired perspective, RatSLAM as introduced in [7, 13, 14, 10, 15], has been proven to be a capable visual SLAM method. RatSLAM is loosely based on the neural processes underlying navigation in the rodent (primarily rat) brain, more specifically the hippocampus. Fig. 2 shows an overview of the visual-SLAM implementation used in this work.
During the preprocessing phase, the input of a camera (visual data) is downsampled to reduce computational cost and to simulate the coarse vision of rats. In this process the incoming visual data is cropped to remove areas that do not provide unique features, like for example the ground. The cropped image is subsampled and converted to greyscale as shown in Fig. 3.
The preprocessed sensory information is now parsed through major components of the RatSLAM architecture:
- •
Pose Cells
- •
Local View Cells
- •
Experience Map
The pose cells [7] encode the robot’s current best pose estimate. Pose cells are represented by a Continuous Attractor Network (CAN) [13, Ch. 4], the posecell network, to resemble the grid cells as introduced in [16]. The grid cells are neurons found in many mammals and are shown to be used in navigation. In the D posecell network, the robot’s pose estimate ( position and heading angle ) is encoded as a energy packet that is moved through energy injection based on odometry and place recognition.
The local view (LV) cells are an expandable array of units used to store the distinct visual scenes as a visual template in the environment using a low resolution subsampled frame of pixels. The visual template generated from the current view is compared to all existing view templates by shifting them relative to each other. If the current view is novel, a local view cell is linked with the centroid of the dominant activity package in the pose cells at the time when a scene is observed. When a scene is seen again, the local view cell injects activity into the pose cells.
The experience map is a semi-metric topological representation of the robot’s path in the environment generated by combining information from the pose cells and local view cells into experiences. Each experience is related to the pose cell and local view cell networks via the following 4-tuple: where represent the location of the cell in the PC network while corresponds to the view associated with the LV cell that relates to the queried experience [17].
Initially the robot relies on odometry which is subject to an accumulating error. When loop closure events happen, meaning a scene has been seen already, the pose estimate based on the odometry is compared to the pose of the experience and graph relaxation is applied [17].
The following differences to RatSLAM have been introduced in the visual-SLAM implementation: first, we use odometry from the robot instead of visual odometry as was originally done to determine the translational and rotational speeds of the robot. Second, the method of template matching and generation has been modified to account for multiple sensory modalities. Third, the posecell network (PC) is now capable of handling a wider range of robot motion such as moving sideways.
II-B Tactile-SLAM
Whisker-RatSLAM is a D tactile SLAM algorithm inspired by RatSLAM. Instead of taking input from a camera, it uses a tactile whisker-array mounted on a robot as its only sensor [18, 19]. The whisker-array consists of whiskers, each capable of measuring the point of whisker contact in D space, and the D deflection force at their base [20]. Whisker-RatSLAM [12] has been demonstrated for mapping objects and localizing the whisker array relative to the surface of an object. Similar to RatSLAM, Whisker-RatSLAM generates a semi-metric topological map, the object exploration map, which contains complex DOF experience nodes.
In [12], the authors proposed combining these object exploration maps with simple DOF experience map generated using RatSLAM with the whisker-input resulting in a topological terrain exploration map with two different types of experience nodes. Fig. 4 shows an overview of the Whisker-RatSLAM algorithm. The tactile data acquired by whisking encompasses D contact point cloud of the object (D Cts.) and the deflection data (Defl.). The point cloud is used to generate the Point Feature Histogram (PFH) while the deflection data is used to generate Slope Distribution Array (SDA). Both PFH and SDA are then fused to obtain a D Feature Cell (FC). Similar to the RatSLAM experience map, the pose grid cells and FC that were active at a specific 6D pose of the whisker-array are associated with each other and combined into experience nodes. The experience in this case is defined as the 7-tuple: where represents the 6D pose including euler angles for orientation and represents the features associated with that experience. The experience node form the object exploration map (Obj. Expl. Map). In order to adapt the activation of the pose cell in accord with the robot motion, the odometry information is also used in the pose grid.
The tactile-SLAM implementation is based on Whisker-RatSLAM, but instead of a D posecell network this work uses the same D posecell network as the visual-SLAM implementation to allow compatibility and to reduce computation cost for navigation in D space. Furthermore, the tactile-SLAM implementation used in this work does not use feature cells, but instead combines the SDA and PFH data into D tactile template that are used in a similar way as D visual templates. Fig. 4 shows an overview of the tactile-SLAM algorithm. The tactile data acquired by whisking encompasses D contact point cloud of the object (D Cts.) and the deflection data (Defl.). The point cloud is used to generate the Point Feature Histogram (PFH) while the deflection data is used to generate Slope Distribution Array (SDA). Both PFH and SDA are then fused to obtain a tactile template (T). Similar to the RatSLAM experience map, the pose grid cells and T that were active at a specific pose of the whisker-array, are associated with each other and combined into experience nodes. The experiences are, opposed to the 7-tuple used in Whisker-RatSLAM, defined as a 4-tuple: and represents the tactile template associated with that experience. The experience nodes also form a semi-metric experience map similar to the visual-SLAM method. Similar to the visual-SLAM method, the robot’s odometry information is also used to move the pose grid.
To generate tactile information using whiskers, one challenge is how to control the whisker-array in order to improve the quality of the sensory information. Previous research on rats [21] has identified a number of whisking strategies that rodents use to potentially improve the sensory information they obtain. One of these strategies is called Rapid Cessation of Protraction (RCP) and refers to the rapid reduction in motor drive applied to the whisker when it makes contact with an object during the protraction phase of exploratory whisking [22]. This effectively reduces the magnitude of bend of the whisker upon contact which in artificial arrays, such as shown in [23], improves the quality of sensory information by constraining the range of sensory response to a region best suited for signal processing. Furthermore, damage to the whiskers from contact is significantly reduced.
As opposed to full D pose estimation in Whisker-RatSLAM, the modified tactile-SLAM estimates only the D pose () to maintain compatibility with the visual-SLAM model. This also helps to reduce the computational overhead of maintaining a D posecell network which is not required for navigation on a mobile robot platform.
II-C ViTa-SLAM
In this section, we present the details of our novel visuo-tactile SLAM algorithm which we refer to as ViTa-SLAM.
The overall system architecture for ViTa-SLAM is shown in Fig. 5: kinds of raw sensory data: tactile, visual, and odometry are now utilized simultaneously. Tactile and visual data are converted into visuo-tactile templates (T, V), respectively and hence, need to be pre-processed. A D pose cell network is maintained. The experience in this approach is now defined as a -tuple: where is a visual template and is a tactile template at the 3D pose given by and . The experience map in this case will be referred to as vita map. In contrast with the conventional experience map, the vita map’s nodes contain visual and tactile data. The nodes are termed sparse node if the tactile data is empty and dense node otherwise. As an example, when the whiskers do not make contact, the whisker tactile information is not providing any information while the camera can still acquire novel scene information. When the whiskers are whisking a wall/landmark, both the camera and whiskers yield features that allow the creation of informative dense nodes which greatly help visuo-tactile SLAM. The properties of a vita-map node (dense or sparse) are stored in the vita-map but not further used.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] T. J. Prescott, M. J. Pearson, B. Mitchinson, J. C. W. Sullivan, and A. G. Pipe, “Whisking with robots,” IEEE robotics & automation magazine , vol. 16, no. 3, pp. 42–50, 2009.
- 2[2] N. Alt and E. Steinbach, “Navigation and manipulation planning using a visuo-haptic sensor on a mobile platform,” IEEE Transactions on Instrumentation and Measurement , vol. 63, no. 11, pp. 2570–2582, 2014.
- 3[3] N. Alt, Q. Rao, and E. Steinbach, “Haptic exploration for navigation tasks using a visuo-haptic sensor,” in Interactive Perception Workshop, ICRA , 2013.
- 4[4] T. Bhattacharjee, A. A. Shenoi, D. Park, J. M. Rehg, and C. C. Kemp, “Combining tactile sensing and vision for rapid haptic mapping,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on , pp. 1200–1207, IEEE, 2015.
- 5[5] L. Pinto, D. Gandhi, Y. Han, Y.-L. Park, and A. Gupta, “The curious robot: Learning visual representations via physical interactions,” in European Conference on Computer Vision , pp. 3–18, Springer, 2016.
- 6[6] R. Calandra, A. Owens, D. Jayaraman, J. Lin, W. Yuan, J. Malik, E. H. Adelson, and S. Levine, “More than a feeling: Learning to grasp and regrasp using vision and touch,” IEEE Robotics and Automation Letters , vol. 3, no. 4, pp. 3300–3307, 2018.
- 7[7] D. Ball, S. Heath, J. Wiles, G. Wyeth, P. Corke, and M. Milford, “Openratslam: an open source brain-based slam system,” Autonomous Robots , vol. 34, no. 3, pp. 149–176, 2013.
- 8[8] J. Steckel and H. Peremans, “Batslam: Simultaneous localization and mapping using biomimetic sonar,” Plo S one , vol. 8, no. 1, 2013.
