Argus: Smartphone-enabled Human Cooperation via Multi-Agent Reinforcement Learning for Disaster Situational Awareness
Vidyasagar Sadhu, Gabriel Salles-Loustau, Dario Pompili, Saman Zonouz,, Vincent Sritapan

TL;DR
Argus leverages multi-agent reinforcement learning with smartphones and drones to create real-time 3D maps of disaster scenes, enhancing situational awareness for rescue operations.
Contribution
This work introduces a novel MARL framework integrating human and robotic agents for real-time disaster scene mapping using smartphones and drones.
Findings
Effective in tracking dynamic environments in simulations.
Successful real-world deployment demonstrates practical utility.
Improves situational awareness for rescue efforts.
Abstract
Argus exploits a Multi-Agent Reinforcement Learning (MARL) framework to create a 3D mapping of the disaster scene using agents present around the incident zone to facilitate the rescue operations. The agents can be both human bystanders at the disaster scene as well as drones or robots that can assist the humans. The agents are involved in capturing the images of the scene using their smartphones (or on-board cameras in case of drones) as directed by the MARL algorithm. These images are used to build real time a 3D map of the disaster scene. Via both simulations and real experiments, an evaluation of the framework in terms of effectiveness in tracking random dynamicity of the environment is presented.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Argus: Smartphone-enabled Human Cooperation via Multi-Agent Reinforcement Learning for Disaster Situational Awareness
Vidyasagar Sadhu, Gabriel Salles-Loustau, Dario Pompili, Saman Zonouz, Vincent Sritapan∗
Department of Electrical and Computer Engineering, ∗Cyber Security Division
Rutgers University, ∗Department of Homeland Security Science & Technology Directorate
- {vidyasagar.sadhu, gs643, pompili, saman.zonouz}@rutgers.edu*, [email protected]
Abstract
Argus exploits a Multi-Agent Reinforcement Learning (MARL) framework to create a 3D mapping of the disaster scene using agents present around the incident zone to facilitate the rescue operations. The agents can be both human bystanders at the disaster scene as well as drones or robots that can assist the humans. The agents are involved in capturing the images of the scene using their smartphones (or on-board cameras in case of drones) as directed by the MARL algorithm. These images are used to build real time a 3D map of the disaster scene. Via both simulations and real experiments, an evaluation of the framework in terms of effectiveness in tracking random dynamicity of the environment is presented.
Keywords:
Disaster response, situational awareness, multi-agent reinforcement learning, distributed Q-learning, data collection, crowdsensing.
I Introduction
Disaster management involves a very large number of heterogeneous agents in an uncontrolled and potentially hostile environment. These agents include victims and the rescue personnel that eventually intervene. Situational-awareness techniques involving better incident management strategies will help a long way reduce disaster damage. Several works focus on modeling incident response operations but very few can be used in real disasters as there are several challenges to be tackled in practical situations. Several challenges in designing an interactive incident response system such as the coordination of the intervention forces are presented in [1].
**Motivations: **Traditionally, incident reporting relies on expensive information collection from authorities. These reportings happen most often in an offline manner and fail to provide ahead of time actionable information to incident response team. However, in these situations it is common to find people present at those locations (bystanders/onlookers) to record the scene using their smartphones. Very few of the existing incident response frameworks take advantage of this valuable information. We argue that this information is valuable because it is first hand, real time and local, and could provide fine-grained information about the scene. Also, in some cases, this information can be complemented by drones taking pictures or videos of the disaster scene.
**Our Approach: **Contrary to existing solutions such as those based on CCTV cameras, our solution, Argus (from Argus Panoptes, the “all-seeing” 100-eye giant in the Greek mythology), utilizes this valuable information coming directly from the bystanders and also provides a base framework wherein both the drones/robots and humans can collaboratively work to provide valuable and fine-grained information to the rescue authorities. Argus is implemented as a mobile application that can be installed on the users’ smartphones to capture and process data such as images of the incident scene. We then consolidate this valuable data into an easy-to-visualize form such as a 3D map of the disaster scene in real time. The advantage of this approach over looking at individual images is that it is difficult to process the latter one by one as there might be a huge number of them streaming in. However, the same reasons become a boon for 3D reconstruction as it needs many images from different views. Also, generating a 3D model from a sequence of images is much cheaper than using other techniques such as 3D scanners. We believe that integrating other well-known technologies such as audio source separation into our framework can add value (such as to prioritize some regions over others). In this work, we make use of a Multi-Agent Reinforcement Learning (MARL) framework for adaptive and effective data collection. The reasons for considering MARL are as follows: (1) to enable full coverage of the scene, without which most of the images would correspond to a few conspicuous regions only (e.g., fire regions); (2) to enable fine-grained (high resolution in 3D maps) coverage in regions of interest specific to the user (e.g., behind the building).
Related Work: The applicability and usefulness of MARL to build evacuation simulations for emergency/disaster situations and to model a large building containing multi-agent heterogeneous population attempting to evacuate in the presence of a non-stationary fire is investigated in [2]. In [3], design and development of a mobile system is presented that reconstructs 3D model of a building’s interior structure in real time and fuses the visualization with the image of a thermal camera. RescueNet [4] proposes a networking paradigm based on MARL to enable reliable and high data-rate wireless multimedia communication among public safety agencies using mission policies that enable graceful degradation in the Quality of Service (QoS) of the incumbent networks in times of emergency. Various solutions for real-time monitoring of incident zones including interactive systems are proposed in [5, 1]. Design and implementation of a framework to simulate incident response such as building evacuation with modeling victims as agents with specific set of properties such as their visibility is presented in [6]. A multiple-agent Markov Decision Process (MDP) model is leveraged to synchronize the actions of the response team after an incident in [7]. None of these frameworks use MARL for data collection from the bystanders to create 3D maps as we do. Regarding MARL, there are two types of frameworks, cooperative and competitive [8]. In the former the agents work cooperatively towards a common goal (with same rewards), while in latter agents compete against each other with opposite rewards (e.g., the sum of the rewards of all agents is zero). Our scenario corresponds to the former one. A simple MARL framework for cooperative agents without assuming any coordination among the agents is proposed in [9]. Each agent maintains a local Q-table and a policy, both of which are updated only if there is an improvement in the Q-values. However, for each local (state, action) pair, it requires that the other agents’ actions happen infinitely often, which may not be practical in time-constrained scenarios. The idea of Q-value sharing is proposed in [10, 11]. We make use of similar concept in our algorithm except that all agents share the same Q-table rather than having individual tables with knowledge sharing as they propose.
Our contributions are summarized below:
- •
A model for real-time incident response data collection using Multi-Agent Reinforcement Learning (MARL) (available as a mobile application) using humans as agents for data collection.
- •
Evaluation of the MARL framework through simulations to find optimal design parameters and to study its behavior for random fire occurrences.
**Paper Outline: **In Sect. II, we present the Argus architecture and the system design including the MARL and Q-learning frameworks. In Sect. III, we study our framework by varying the design parameters and also evaluate its performance for random fire occurrences. Finally, we conclude the paper and mention our future directions in Sect. IV.
II Proposed solution
**Argus use case: **Fig. 1(a) presents a use-case scenario for Argus, which shows a building (represented by cube) on fire. Other use cases could be large forest fires, flooding, a terrorist attack, etc. Argus is installed as a mobile phone application and directs bystanders present at the incident zone on what actions to take (e.g., what pictures to take using their smartphones) to gather relevant information pertaining to the incident zone. This is achieved as follows. Argus decomposes the incident zone into multiple subzones with sufficient overlap. For example, in the case of building on fire, each subzone could correspond to a face or corner of the building with corners enabling overlap between the faces of the building. Each subzone is further divided into a rectangular grid consisting of states , as shown in Fig. 1(a). Each grid is cast as a Markov Decision Process (MDP), which is solved using MARL with agents being the bystanders around the incident zone (A,B,C) and any drones (D) that can assist humans in the data-gathering process. Argus mobile application, which interfaces with the MARL framework running in the cloud, directs the agents (humans or drones) on what information to collect from the incident zone by giving directions such as “Left,” “Up,” etc. The drones could take part in the process just as humans or could complement them by capturing information that may not be accessible to humans (for example, a forest lining one face of the building that is not accessible to humans) or they could be assigned special tasks to capture information in regions where it is incomplete and immediately needed. As mentioned earlier, we have chosen to create a 3D map of the disaster scene, in which case the information gathered from the agents will be images or videos of the disaster scene. We aim at providing near real-time 3D mapping of the disaster scene so that the rescue personnel are aware of the entire situation, know where to immediately focus their attention on or plan appropriately. We used a GUI program called VisualsFM [12] for the 3D reconstruction, which consists of two parts, a well-known Structure From Motion (SFM) algorithm [13] for sparse 3D and CMVS/PMVS [13] for dense 3D reconstruction. Fig. 2(b) and Fig. 2(c), respectively, show the sparse and the dense 3D reconstruction of a building (Fig. 2(a)) using SFM and PMVS/CMVS packages found in VisualSfM. We have used 12 images of the building taken from different views using LG D800 smartphone with resolution. We would like to mention that in real incidents it is realistic assumption to have hundreds of images, making the reconstruction more detailed. Also, there will be a higher resolution in areas of interest such as fire regions, damage regions, etc. as they will be the reward states in MARL, as we explain below.
**Argus architecture: **Our framework consists of the following parts, as shown in Fig. 1(b): 1. A client-side data-collection framework that is part of the mobile application for collecting audio and video from the users smartphone. 2. A server side, MARL framework that interfaces with the mobile application (MARL Interface) and directs the agents on what parts of the disaster scene to capture by providing directions such as “Left,” “Up,” etc. 3. A server-side data-processing framework that uses the images and/or audio data sent by the clients to generate 3D maps and other processing tasks (such as audio source separation). This processed data in the form of 3D maps is then sent to rescue office personnel who are in a position to guide rescue personnel using this information. The MARL framework directs the agents (humans/drones) to capture photos in a proper way, i.e., capturing a few photos to give the context and many others primarily focused around the regions of interest. Our framework achieves this by assigning more rewards to the data corresponding to interest regions. We make use of distributed Q-learning to solve our MARL problem. During the exploration phase of Q-learning, the agents explore the entire disaster scene to find reward states and create a coarse 3D map of the entire scene. Then, during the exploitation phase, more time is spent in reward states to create a finer 3D maps of those regions. We chose Q-learning as it is model free and can adapt to changing environments (such as fire propagation).
**System Design: **We now describe our MARL framework consisting of a distributed Q-learning approach for exploration and of a periodic exploration-exploitation mechanism to capture the dynamicity of the environment. To reiterate, our MARL framework is used for data collection only. As the images are obtained, 3D reconstruction is done using those images in real time, as mentioned in Sect. II.
MARL Framework: Although our framework is quite general and can be applied to any scenario, in order to be concrete we will continue with our example of building on fire as in Fig. 1(a). We formulate the problem as a MDP as follows. We divide the building into 8 surface views consisting of 4 faces and 4 corners assuming a regular cuboid structure. Each surface and corner are modeled as a rectangular grid, and the bystanders present near that view are assigned to that particular MDP making that process a Multi-Agent System (MAS). In case of corners, the grid is created by unfolding them to plain surfaces and considering some portions from each of the constituting faces. Each MDP acts independently of others. Images from all MDPs are fused to form the 3D map of the building. The states are the cells (x,y) of the grid. For now, let us assume that the humans (agents) participating in our framework are static and can only pan or tilt their phones. An agent is said to be in a state (x,y) if it positions its smartphone camera focusing/centering on the cell (x,y). The actions for each state are that of a standard grid-world MDP: {Left, Right, Up, Down}. Here, Left and Right are realized by panning (rotating along vertical axis) the phone by a specific angle in those directions. Two continuous Right actions would mean panning by to the right and so on. On similar lines, Up and Down are realized by tilting (rotating along horizontal axis) the phone by a specific angle . We could make use of gyroscopes in the phone to provide feedback to the users on when to stop panning (or tilting) the phone when they have covered an angle of (or ) in the specified directions. Whenever an agent takes an action to move to a different state, it captures the image of the current state (grid cell in the MDP) before panning/tilting the phone. This image determines the reward the agent receives for that state. In other words, our reward is dependent on the state alone rather than on the state-action-state sequence. Since the regions with more fire are of primary interest, more rewards are assigned if more percentage of fire pixels is detected in the image. In order to avoid cheating, we normalize this by the zoom factor (Eq. 1). We note that reward can be changed as per user’s requirements. If uniform coverage is the final goal, incomplete regions will be reward states. Using this approach, the algorithm can gear the agents to capture images of interest to the user, where our reward is,
[TABLE]
Finally, the size of the grid and will be determined by these factors: the dimension of the incident zone, the average distance of the agents from the incident zone, and the average capture angle of the agents’ devices.
*1) Distributed Q-Learning: *In Q-learning the agent learns from each experience as it interacts with the environment and updates its Q-values based on the reward that it gets from this experience. Let us denote the set of all states as , the set of all actions as , and the reward as . A single experience with the environment at time can be represented by the tuple . This means the agent was in state , took action , received reward , and landed in state . The agent learns (updates its Q-values) through a series of such experiences. Let us denote the Q-value of a state-action pair at time as . With the experience from , the Q-value can be updated as follows,
[TABLE]
where is the learning rate and is the discount factor. As can be seen from (2), Q-values are updated based on the immediate reward, , and on the optimal expected return, . Note that the learning rate is set to a large number in the beginning and is slowly reduced to ensure convergence.
Given that we have a MARL problem, we make use of distributed Q-learning (slightly modified from [10, 11]) where the agents share the same Q-values using a shared database (with synchronous read/write) to expedite the exploration process. We would like to clarify that it is still a multi-agent system except that Q-values are same for all. Each agent still acts independently. As all agents are exploring in parallel and making use of the knowledge/experience gained from all other agents (Q-values), the exploration process is greatly sped up.
*2) Periodic Exploration: *There are two well-known exploration strategies: -greedy approach and Boltzmann exploration. Even though both strategies drive the randomness in selecting action to reduce over time, there is a disadvantage with the former as -greedy gives equal importance to all actions in case it needs to select a random action (with probability ) with no regard to the Q-values. Let us say there are 3 actions with decreasing Q-values other than the optimal action, which has even higher Q-value. In case of -greedy, all 3 actions are given equal importance whereas it would be desirable to assign a probability that is directly proportional to the Q-value. Boltzmann exploration achieves this, where the probability of selecting an action is directly related to its Q-value (proportional to ),
[TABLE]
Here, the temperature parameter, , decides the amount of exploration or exploitation; if is large, all actions have almost equal probability resulting in pure exploration whereas when tends to 0 the optimal action has the highest probability, resulting in pure exploitation.
In order to adapt to the dynamicity of the environment, we enable periodic exploitation. In each period, there is a short exploration phase followed by a long exploitation phase. Accordingly, we vary the exploration parameter () and the learning rate () in a periodic fashion (Fig. 3). We note there are two extremes to this mechanism. On one extreme there is a need to account for random occurrences of fire in scenarios where the environment changes randomly, e.g., during a terrorist attack; in these cases, we reset the Q-values and the exploration parameter (Boltzmann temperature) across periods to enable new learning. On the other extreme, we have situations such as slow fire propagation where there is no significant change in the environment across periods; in this case, it is preferable to carry forward the Q-values to the next period so to leverage the learning in the previous period. The exploration parameter, , too can be reduced quickly in the next period to enable less exploration and more exploitation. In both cases, the period will depend on the rate of dynamicity of the scene.
**Smartphone Application: **Fig. 4 shows some screen-shots of our application. We enable the users to share other sensor information such as camera, microphone, GPS, accelerometer, gyroscope as this will help in the 3D reconstruction (Fig. 4(a)); Fig. 4(b) shows the camera interface asking users to point towards incident zone. After capturing image, Fig. 4(c) shows the action to tilt UP (indicated by UP arrow) before taking the next picture. The app has an option to specify the upload server. In a real deployment, this will be the IP address of the central server in Fig. 1.
III Performance Evaluation
We evaluated our MARL framework based on distributed Q-learning in terms of key metrics such as percentage of time spent by agents on capturing the fire states (reward-states) vs. non-fire states as the parameters of the framework including the number of agents, decay rate of Boltzmann temperature, and learning rate are varied; also, we tested how well the algorithm adapts to random incidents. We coded the MARL framework in Python using distributed Q-learning on a 4x4 MDP grid shown in Fig. 5(a) with as the start state.
We remind the reader that the fire region will be divided into multiple MDPs where each MDP corresponds to a single surface of building, a corner of the building, or some other view of the incident scene. A rectangular grid will be overlaid on each of the views where the ensuing MDP will be solved. Fig. 5(a) shows an example of this scenario (the picture is sourced from the Internet). A 4x4 grid is overlaid on the incident zone, which means we have 16 states/cells in the MDP. Each agent’s action in a state will include capturing a picture of the state (centering smartphone on the state) along with the physical action of panning/tilting. Since we do not have pictures of the incident scene in Fig. 5(a), we approximated the image of each state to the grid piece obtained by dividing the picture in Fig. 5(a), as per the white grid lines. Unlike in this simplified scenario, in reality we would have multiple pictures of each state captured by different agents at different times and the reward would be calculated as the percentage of fire pixels in those images. Using these approximated images (grid pieces), we calculated the percentage of fire pixels in them and then normalized it to obtain the rewards for these states. We can observe that only states S2, S3, S6, S7 have non-zero rewards. We ran the distributed Q-learning with discount factor of 0.9 until convergence of the values of the states. Note that all the agents share the same Q-values in our distributed Q-learning approach. The final Q-values and state values (which are maximum Q-values in each state) are shown in Fig. 5(b) and Fig. 5(c), respectively. We can notice from Fig. 5(c) that fire states have higher values compared to the others and that the rewards propagate to surrounding states. Also, the optimal policy indicated by arrow heads leads the agent to the fire states where it stays most of the time.
Exploration vs. number of agents: Fig. 6(a) shows the number of MDP time steps needed for full exploration of the 4x4 MDP grid shown in Fig. 5(a) for different values of fixed learning rate, . We can see that the number of steps needed reduces drastically (from around 3000 to 80) as the number of agents increases, which shows the benefits of distributed Q learning and availability of multiple bystanders at the incident zone. We can also notice that is the best as the agents learn faster and take less number of steps to reach convergence. We also note that in practice there is no need to wait until full convergence of the values as policy converges faster than values. We also make an assumption that each time step in the Q-learning corresponds to about one second in wall-clock time. The decay in Boltzmann temperature (same for ) is modeled as a radioactive decay, .
Time spent in fire states vs. and : Fig. 6(b) shows the percentage of time spent in fire states as is varied for different values of for single-agent case with 180 time steps (3 minutes) (we have plotted the variation with different time steps in Fig. 7(c)). We know that is inversely proportional to the decay rate. Large values correspond to more exploration and less exploitation, and viceversa. We can observe that the percentage of time spent on fire states reduces as increases. This is because there is more exploration as increases, thus reducing the time spent on fire states. We observe a similar behavior in Fig. 6(c), which shows the percentage of time spent in fire states as is varied for different values of also for a single agent with three minutes of allotted time. These results are averaged over 50 runs to account for randomicity present in the action selection that is based on the values of . The ratio of exploitation to exploration can initially be around 75:25. Later this can be adaptively changed based on fire propagation. For example, if fire is spreading more rapidly, we would want to change this ratio to 90:10. However, we note that a single agent is not able to reach 70% exploitation within the time allotted. Due to space constraints we have not included another figure, but the single agent is able to reach more than 75% exploitation when the time allotted is increased to five minutes. In order to show the benefits of multiple agents, we ran the same simulation with ten agents. The results are depicted in Fig. 7(a) and Fig. 7(b), which show that there is an improvement in the percentage of time spent in the fire states compared to the one-agent case. These results help us determine the values of the design parameters , based on the exploitation-exploration ratio preferences. For example, if we have ten agents and we prefer them to stay in fire regions for 75% of time (in 3-minute window), we set .
Time spent in fire states vs. time allotted: Fig. 7(c) shows the percentage of time spent in fire states as the time allotted to the agents increases from one to ten minutes for different number of agents. We have set based on our previous observation. We can note that the percentage increases both as the time allotted increases as well as the number of agents increases. However, we note a saturation behavior in both, implying that there is a minimum amount of time needed for exploration.
Model adaptation to random fire occurrences: To evaluate our algorithm’s ability to adapt to dynamic environments, we evaluated our MARL framework as the fire position changes every three minutes in a random manner, as shown in Fig. 8. We can see that the system is able to adapt to change in fire positions by adaptively changing the percentage of time spent in fire states. We have enabled periodic exploration with resetting of parameters, i.e., Q-values to 0, .
IV Conclusions and Future Work
We presented Argus, a Multi-Agent Reinforcement Learning (MARL) framework (implemented as a mobile application plus a backend server) for cooperative data collection to build a 3D mapping of the disaster scene using agents present around the incident zone to facilitate the rescue operations. As future work, we will incorporate local processing to handle network failures and develop privacy-preserving data-collection solutions; also, we plan to evaluate our framework in a pilot study.
Acknowledgments: We thank the Department of Homeland Security Science & Technology Directorate (DHS S&T) Cyber Security Division for their support under contract No. D15PC00159.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Kyng, E. T. Nielsen, and M. Kristensen, “Challenges in Designing Interactive Systems for Emergency Response,” in Proc. of the Conference on Designing Interactive Systems , 2006.
- 2[2] A. Wharton, “Simulation and Investigation of Multi-Agent Reinforcement Learning for Building Evacuation Scenarios (Master Thesis).” www.robots.ox.ac.uk/~ash/4YP%20Report.pdf , 2009.
- 3[3] C. Schönauer, E. Vonach, G. Gerstweiler, and H. Kaufmann, “3D Building Reconstruction and Thermal Mapping in Fire Brigade Operations,” in Proc. of the Augmented Human International Conference , 2013.
- 4[4] E. Lee, H. Viswanathan, and D. Pompili, “Rescue Net: Reinforcement-Learning-based Communication Framework for Emergency Networking.” Computer Networks (Elsevier), Apr. 2016.
- 5[5] S. Mehrotra, C. Butts, D. Kalashnikov, N. Venkatasubramanian, R. R. Rao, G. Chockalingam, R. Eguchi, B. Adams, and C. Huyck, “Project RESCUE: challenges in responding to the unexpected,” in Electronic Imaging 2004 , pp. 179–192, 2003.
- 6[6] V. Balasubramanian, D. Massaguer, S. Mehrotra, and N. Venkatasubramanian, “Drill Sim: A Simulation Framework for Emergency Response Drills,” in Intelligence and Security Informatics , Lecture Notes in Computer Science, pp. 237–248, Springer Berlin Heidelberg, 2006.
- 7[7] S. D. Ramchurn, F. Wu, W. Jiang, J. E. Fischer, S. Reece, S. Roberts, T. Rodden, C. Greenhalgh, and N. R. Jennings, “Human–agent collaboration for disaster response,” Autonomous Agents and Multi-Agent Systems , pp. 1–30, 2015.
- 8[8] L. Busoniu, R. Babuska, and B. De Schutter, “A Comprehensive Survey of Multiagent Reinforcement Learning,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on , vol. 38, pp. 156–172, March 2008.
