ROS Rescue : Fault Tolerance System for Robot Operating System
Pushyami Kaveti, Hanumant Singh

TL;DR
This paper introduces a lightweight fault-tolerant mechanism for ROS master failure, enabling recovery without full restart, tested successfully across various robotic platforms.
Contribution
A novel, low-overhead ROS master recovery system with logging and monitoring, improving fault tolerance over existing heavy solutions.
Findings
Successfully implemented on land, aerial, underwater robots
Enables recovery without aborting nodes
Code available on GitHub
Abstract
In this chapter we discuss the problem of master failure in ROS1.0 and its impact on robotic deployments in the real world. We address this issue in this tutorial chapter where we outline, design and demonstrate a fault tolerant mechanism associated with ROS master failure. Unlike previous solutions which use primary backup replication and external checkpointing libraries which are process heavy, our mechanism adds a lightweight functionality to the ROS master to enable it to recover from failure. We present a modified version of ROS master which is equipped with a logging mechanism to record the meta information and network state of ROS nodes as well as a recovery mechanism to go back to the previous state without having to abort or restart all the nodes. We also implement an additional master monitor node responsible for failure detection on the master by polling it for its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Robotics and Automated Systems · Modular Robots and Swarm Intelligence
