Q-Cogni: An Integrated Causal Reinforcement Learning Framework

Cris Cunha; Wei Liu; Tim French; Ajmal Mian

arXiv:2302.13240·cs.LG·February 28, 2023

Q-Cogni: An Integrated Causal Reinforcement Learning Framework

Cris Cunha, Wei Liu, Tim French, Ajmal Mian

PDF

Open Access

TL;DR

Q-Cogni introduces a causal reinforcement learning framework that integrates causal structure discovery into Q-Learning, enhancing learning efficiency, interpretability, and policy quality in complex environments like vehicle routing.

Contribution

It presents a novel causal reinforcement learning framework that combines causal inference with Q-Learning, enabling better decision-making and interpretability in high-dimensional problems.

Findings

01

Q-Cogni outperforms state-of-the-art algorithms in VRP tasks.

02

It achieves 85% success rate in real-world taxi routing.

03

The framework improves learning efficiency and interpretability.

Abstract

We present Q-Cogni, an algorithmically integrated causal reinforcement learning framework that redesigns Q-Learning with an autonomous causal structure discovery method to improve the learning process with causal inference. Q-Cogni achieves optimal learning with a pre-learned structural causal model of the environment that can be queried during the learning process to infer cause-and-effect relationships embedded in a state-action space. We leverage on the sample efficient techniques of reinforcement learning, enable reasoning about a broader set of policies and bring higher degrees of interpretability to decisions made by the reinforcement learning agent. We apply Q-Cogni on the Vehicle Routing Problem (VRP) and compare against state-of-the-art reinforcement learning algorithms. We report results that demonstrate better policies, improved learning efficiency and superior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTransportation and Mobility Innovations · Auction Theory and Applications · Traffic control and management

MethodsQ-Learning

Full text

Q-Cogni: An Integrated Causal Reinforcement Learning Framework

Cristiano da Costa Cunha1

Wei Liu1

Tim French1

Ajmal Mian1 1Department of Computer Science, University of Western Australia

[email protected], {wei.liu, tim.french, ajmal.mian}@uwa.edu.au

Abstract

We present Q-Cogni, an algorithmically integrated causal reinforcement learning framework that redesigns Q-Learning with an autonomous causal structure discovery method to improve the learning process with causal inference. Q-Cogni achieves optimal learning with a pre-learned structural causal model of the environment that can be queried during the learning process to infer cause-and-effect relationships embedded in a state-action space. We leverage on the sample efficient techniques of reinforcement learning, enable reasoning about a broader set of policies and bring higher degrees of interpretability to decisions made by the reinforcement learning agent. We apply Q-Cogni on the Vehicle Routing Problem (VRP) and compare against state-of-the-art reinforcement learning algorithms. We report results that demonstrate better policies, improved learning efficiency and superior interpretability of the agent’s decision making. We also compare this approach with traditional shortest-path search algorithms and demonstrate the benefits of our causal reinforcement learning framework to high dimensional problems. Finally, we apply Q-Cogni to derive optimal routing decisions for taxis in New York City using the Taxi & Limousine Commission trip record data and compare with shortest-path search, reporting results that show 85% of the cases with an equal or better policy derived from Q-Cogni in a real-world domain.

1 Introduction

Evidence suggests that the human brain operates as a dual system. One system learns to repeat actions that lead to a reward, analogous to a model-free agent in reinforcement learning, the other learns a model of the environment which is used to plan actions, analogous to a model-based agent. These systems coexist in both cooperation and competition which allows the human brain to negotiate a balance between cognitively cheap but inaccurate model-free algorithms and relatively precise but expensive model-based algorithms Gershman (2017).

Despite the benefits causal inference can bring to autonomous learning agents, the degree of integration in artificial intelligence research are limited. This limitation becomes a risk, in particular that data-driven models are often used to infer causal effects. Solely relying on data that is never bias-free, eventually leads to untrustworthy decisions and sub-optimal interventions Prosperi et al. (2020).

In this paper, we present Q-Cogni a framework that integrates autonomous causal structure discovery and causal inference into a model-free reinforcement learning method. There are several emergent methods for integrating causality with reinforcement learning such as reward correction Buesing et al. (2018), meta-reinforcement learning Dasgupta et al. (2019), latent causal-transition models Gasse et al. (2021), schema networks Kansky et al. (2017) and explainable agents Madumal et al. (2020). However, no method presents an approach that embeds causal reasoning, from an autonomously derived causal structure of the environment, during the learning process of a reinforcement learning agent to guide the generation of an optimal policy. Thus, our approach is able to target improvements in policy quality, learning efficiency and interpretability concurrently.

Q-Cogni samples data from an environment to discover the causal structure describing the relationship between state transitions, actions and rewards. This causal structure is then used to construct a Bayesian Network which is used during a redesigned Q-Learning process where the agent interacts with the environment guided by the probability of achieving the goal and receiving rewards in a probabilistic manner. The causal structure integrated with the learning procedure delivers higher sample efficiency as it causally manages the trade-off between exploration and exploitation, is able to derive a broader set of policies as rewards are much less sparse and provides interpretability of the agent’s decision making in the form of conditional probabilities related to each state transition for a given set of actions.

We validate our approach on the Vehicle Routing Problem (VRP) Toth and Vigo (2002). We start by comparing optimal learning metrics against state-of-the-art reinforcement learning algorithms PPO Schulman et al. (2017) and DDQN Van Hasselt et al. (2016), using the Taxi-v3 environment from OpenAI Gym Brockman et al. (2016). We also compare the advantages and disadvantages of Q-Cogni against the shortest-path search algorithms Djikstra’s Dijkstra (1959) and A* Hart et al. (1968) with a particular focus in understanding applicability and scalability. Finally, we run experiments in a real-world scale problem, using the New York City TLC trip record data, which contains all taxi movements in New York City from 2013 to date Taxi and Comission (2022), to validate Q-Cogni’s capabilities to autonomously route taxis for a given pickup and drop-off.

Our contributions with Q-Cogni are three-fold. Firstly, Q-Cogni is the first fully integrated, explainable, domain-agnostic and hybrid model-based and model-free reinforcement learning method that introduces autonomous causal structure discovery to derive an efficient model of the environment and uses that causal structure within the learning process. Secondly, we redesigned the Q-Learning algorithm to use causal inference in the action selection process and a probabilistic Q-function during training in order to optimise policy learning. Finally, through extensive experiments, we demonstrate Q-Cogni’s superior capability in achieving better policies, improved learning efficiency and interpretability as well as near-linear scalability to higher dimension problems in a real-world navigation context.

2 Background

The focus of this work lies at the unification of causal inference and reinforcement learning. This is an emerging field that aims to overcome challenges in reinforcement learning such as 1) the lack of ability to identify or react to novel circumstances agents have not been programmed for Darwiche (2018); Chen and Liu (2018), 2) low levels of interpretability that erodes user’s trust and does not promote ethical and unbiased systems Ribeiro et al. (2016); Marcus (2018) and 3) the lack of understanding of cause-and-effect relationships Pearl (2010).

Our approach builds upon a wealth of previous contributions to these areas, which we briefly cover below.

Causal Structure Discovery. Revealing causal information by analysing observational data, i.e. “causal structure discovery”, has been a significant area of recent research to overcome the challenges with time, resources and costs by designing and running experiments Kuang et al. (2020).

Most of the work associated with integrating causal structure discovery and reinforcement learning have been focused on using reinforcement learning to discover cause-and-effect relationships in environments which agents interact with to learn Zhu et al. (2019); Wang et al. (2021); Huang et al. (2020); Amirinezhad et al. (2022); Sauter et al. (2022). To our knowledge, a small amount of work has explored the reverse application, such as schema networks Kansky et al. (2017), counterfactual learning Lu et al. (2020) and causal MDPs Lu et al. (2022).

We build upon this work and redesign the way in which structure causal models (SCMs) are used. In the related work they are typically used to augment input data with what-if scenarios a-priori to the agent learning process. In our approach, the SCM is embedded as part of a redesigned Q-Learning algorithm and only used during the learning process. Our approach also enables learning a broader set of policies since what-if scenarios are estimated for each state-action pair during the learning process. This not only improves policy optimality but also provides a superior sample efficiency as it allows for “shortcutting” the exploration step during the learning process.

Causal Inference. Recent work has demonstrated the benefits of integrating causal inference in reinforcement learning.

In Seitzer, Schölkopf, and Martius Seitzer et al. (2021) the authors demonstrate improvement in policy quality by deriving a measure that captures the causal influence of actions on the environment in a robotics control environment and devise a practical method to integrate in the exploration and learning of reinforcement learning agents.

In Yang et al. Yang et al. (2022) the authors propose an augmented DQN algorithm which receives interference labels during training as an intervention into the environment and embed a latent state into its model, creating resilience by learning to handle abnormal event (e.g. frozen screens in Atari games).

In Gasse et al. Gasse et al. (2021) the authors derive a framework to use a structural causal model as a Partially Observable Markov Decision Process (POMDP) in model-based reinforcement learning.

Leveraging upon these concepts, in our approach we expand further the structural causal model and fit it with a Bayesian Network. This enables our redesigned Q-Learning procedure to receive rewards as a function of the probability of achieving a goal for a given state-action pair, significantly improving the sample efficiency of the agent, as in each step the agent is able to concurrently derive dense rewards for several state transitions regulated by the causal structure. To our knowledge this is an integration perspective not yet explored.

Model-Free Reinforcement Learning. Centre to the reinforcement learning paradigm is the learning agent, which is the “actor” that learns the optimal sequence of actions for a given task, i.e. the optimal policy. As this policy is not known a priori, the aim is to develop an agent capable of learning it by interacting with the environment Kaelbling et al. (1996), an approach known as model-free reinforcement learning.

Model-free reinforcement learning relies on algorithms that sample from experience and estimate a utility function such as SARSA, Q-Learning and Actor-Critic methods Arulkumaran et al. (2017). Recent advances in deep learning have promoted growth of model-free methodsÇalışır and Pehlivanoğlu (2019). However, whilst model-free reinforcement learning is a promising area to enable human-level artificial intelligence, it comes with its own limitations. These are the applicability restricted to a narrow set of assumptions (e.g. a Markov Decision Process) that is not necessarily reflective of the dynamics of the real-world environment John (2020); lower performance when evaluating off-policy decisions (i.e. policies different than those contained in the underlying data used by the agent) Bannon et al. (2020) and perpetual partial observability since sensory data provide imperfect information about the environment and hidden variables are often the ones causally related to rewards Gershman (2017). These limitations are a disadvantage of model-free reinforcement learning which can be overcome with explicit models of the environment, i.e. model-based reinforcement learning. However, whilst the model-based approach would enhance sample efficiency, it also would come at a cost of increased computational complexity as many more samples are required to derive an accurate model of the environment Polydoros and Nalpantidis (2017).

In our work we use causal structure discovery to simultaneously provide model-free reinforcement learning agents with the ability of dealing with imperfect environments (e.g., latent variables) and maintain sample efficiency of a model-based approach. A hybrid approach not extensively explored to our knowledge.

3 Q-Cogni

We present Q-Cogni, a framework that integrates autonomous causal structure discovery, causal inference and reinforcement learning. Figure 1 illustrates the modules and interfaces which we further detail below.

3.1 Autonomous Causal Structure Discovery

The first module in Q-Cogni is designed to autonomously discover the causal structure contained in an environment. It starts by applying a random walk in the environment while storing the state, actions and rewards. The number of steps required to visit every state in the environment with a random walk is proportional to the harmonic number, approximating the natural logarithm function which grows without limit, albeit slowly, demonstrating the efficiency of the process. This sampled dataset contains all the information necessary to describe the full state-action space and its associated transitions. A further benefit of our approach is that this step only needs to be performed once regardless of the environment configuration.

We use the NOTEARS algorithm Zheng et al. (2018) in the Q-Cogni framework, providing an efficient method to derive the causal structure encoded in the dataset sampled from the environment.

The resulting structure learned is then encoded as a DAG G with nodes v $\in$ G, state variables x, actions a and edges e $\in$ G which represent the state transition probabilities. With a maximum likelihood estimation procedure, the discovered structure is then fitted with the dataset sample generated to estimate the conditional probability distributions of the graph and encode it as a Bayesian Network.

Whilst this module focuses on autonomous causal structure learning, Q-Cogni provides the flexibility to receive human inputs in the form of tabu nodes and edges, i.e. constraints in which a human expert can introduce in the causal structure discovery procedure. This capability allows integration between domain knowledge with a data-driven approach providing a superior model in comparison to using either in isolation.

3.2 Causal Inference

We leverage upon the causal structure model discovered and the Bayesian Network of the environment estimated to provide Q-Cogni’s reinforcement learning module with causal inference capabilities.

The causal inference sub-module uses the causal DAG $G(V,E)$ and receives from the Q-Cogni agent a single state $s\in S$ containing each state variable $x\in s$ with values sampled from the environment $M$ , the actions list $A$ containing each action $a\in A$ and the first-priority sub-goal $o$ to be solved. The marginals in each node $v\in V\cap s$ are updated with the state variable values $x$ $\forall~{}x$ $\in s$ . The procedure described in Algorithm 1 selects the best action $a^{*}$ and calculates the associated $P(o={\rm True}|x,A)$ where $a^{*}\in A$ . This is analogous to a probabilistic reward estimation r for the given $(s,a^{*})$ pair.

This module enables the gains in learning efficiency by the agent as it shortcuts the reinforcement learning exploration procedure through the structural prior knowledge of the conditional probability distributions of $(s,a)$ pairs encoded in the DAG. It also provides explicit interpretability capabilities to the Q-Cogni reinforcement learning module by being able to estimate $P(o={\rm True}|x,A)$ .

3.3 Modified Q-Learning

The modified Q-Learning module uses a hybrid learning procedure that uses Q-Cogni’s causal inference module and a $\epsilon$ -decay exploration strategy. In addition, one central idea in Q-Cogni is to use the inherent and known structure of the reinforcement learning sub-goals for a given task to reduce the problem dimensionality. Whilst this is not a strict limitation to the approach, when available a-priori it gives a significant advantage in computational efficiency for the learning procedure as Q-Cogni shrinks the $state-action$ space to the subset that matters for a given sub-goal. Q-Cogni’s learning procedure can receive a prioritised list $O$ of ordered reinforcement learning sub-goals $o$ and uses that information to choose when to “explore vs. infer”. If such a goal sequence is not known a-priori the benefits from our approach still hold, albeit a lower sample efficiency but still superior to a traditional agent that would require balancing exploration vs. exploitation.

To achieve that, for the prioritised sub-goal $o$ , Q-Cogni assesses if $\max P(o=True|x,A)$ takes place when $a^{*}\in A$ is a $parent$ node $\in V$ of the sub-goal node $o$ . In this case, the agent selects $a^{*}$ and directly applies into the environment to obtain the reward r adjusted by $P(o=True|x,A)$ during the value function update procedure, a step taken to avoid reward sparsity and improve learning performannce. The Q-table stores this result, providing a more robust estimation of value without having to perform wide exploration in the environment in contrast to unadjusted rewards. Otherwise, the Q-Cogni agent will perform the $\epsilon$ -decay exploration procedure. Algorithm 2 describes the modified Q-Learning routine in Q-Cogni.

This routine enables optimised learning. Policies are improved by the upfront knowledge acquired with the causal structure module and the representation of the state transition outcomes. Unnecessary exploration of state-action pairs that do not improve the probability of achieving the learning goal are eliminated, thus improving learning efficiency.

4 Approach Validation

To validate our approach, we start with the Vehicle Routing Problem (VRP). Here, we formally define the VRP problem and briefly discuss traditional solutions on which we build ours upon.

VRP. In our work, inspired by the emergence of self-driving cars, we use a variant of the VRP, where goods need to be picked up from a certain location and dropped off at their destination. The pick-up and drop-off must be done by the same vehicle, which is why the pick-up location and drop-off location must be included in the same route Braekers et al. (2016).

This variant is known as the VRP with Pickup and Delivery, a NP-hard problem extensively studied by the operations research community given its importance to the logistics industry. The objective is to find the least-cost tour, i.e. the shortest route, to fulfill the pickup and drop-off requirements Ballesteros Silva and Escobar Zuluaga (2016).

Shortest-Path Search Methods. The shortest-path search problem is one of the most fundamental problems in combinatorial optimisation. As a minimum, to solve most combinatorial optimisation problems either shortest-path search computations are called as part of the solving procedure or concepts from the framework are used Gallo and Pallottino (1986). Similarly, it is natural to solve the VRP with Pickup and Delivery with shortest-path search methods.

Despite successful shortest-path search algorithms such as Djikstra’s and A*, VRP as a NP-hard problem can be very challenging for these methods. Exact algorithms like Djikstra’s can be computationally intractable depending on the scale of the problem Drori et al. (2020); approximate algorithms like A* provide only worst-case guarantees and are not scalable Williamson and Shmoys (2011). Reinforcement learning is an appealing direction to such problems as it provides a generalisable, sample efficient and heuristic-free method to overcome the characteristic computational intractability of NP-hard problems.

5 Experimental Results

We start with the Taxi-v3 environment from OpenAI Gym Brockman et al. (2016), a software implementation of an instance of the VRP with Pickup and Delivery. The environment was first introduced by Dietterich Dietterich (2000) to illustrate challenges in hierarchical reinforcement learning.

Figure 2 illustrates the Taxi-v3 environment with the example of a solution using the Q-Learning algorithm. The 5 $\times$ 5 grid has four possible initial locations for the passenger and destination indicated by R(ed), G(reen), Y(ellow), and B(lue). The objective is to pick up the passenger at one location and drop them off in another. The agent receive a reward of +20 points for a successful drop-off, and receives a reward of -1 for every movement. There are 404 reachable discrete states and six possible actions by the agent (move west, move east, move north, move south, pickup and deliver).

All experiments were performed on a p4d.24xlarge GPU enabled AWS EC2 instance.

5.1 Optimal Learning

We trained Q-Cogni, Q-Learning, DDQN and PPO algorithms for 1,000 episodes in the Taxi-v3 environment. DDQN and PPO were implemented using the Rlib python package Liang et al. (2018). Hyperparameters for DDQN and PPO were tuned using the BayesOptSearch module, using 100 trials over 1,000 episodes each.

5.1.1 Results and Discussion.

Figure 3 illustrates the autonomously discovered structure for the Taxiv3 environment using Q-Cogni, after 500,000 samples collected with a random walk. We used the implementation of the NOTEARS algorithm in the CausalNex python package Beaumont et al. (2021) to construct the causal structure model and fit the conditional probability distributions through a Bayesian Network. The relationships discovered are quite intuitive demonstrating the high performance of the method. For example, for the node passenger in taxi to be $True$ the nodes taxi on passenger location and pickup action must be $True$ .

The only domain related inputs given to the algorithm were constraints such as the sub-goal node 1 pax in taxi cannot be a child node of the sub-goal 2 node drop-off and location nodes must be a parent node. In addition, the list of ordered sub-goals was provided to Q-Cogni’s reinforcement learning module as [pax in taxi, drop-off].

Figure 4 shows the results achieved over the 1,000 training episodes. We observe that all methods present similar policy performance (total reward per episode towards the end of training). However, Q-Cogni achieves superior stability and learning efficiency in comparison to all other methods, as it is able to use the causal structure model and its causal inference capability to accelerate the action selection process when interacting with the environment. In addition, Figure 5 demonstrates the interpretability capabilities of Q-Cogni. At each step, a probability of the best action to be taken is provided allowing for better diagnostics, tuning and most importantly assessment of possible biases built into autonomous agents such as Q-Cogni.

5.2 Comparison to Shortest-Path Search Methods

We analyse the characteristics of our approach against shortest-path search methods to highlight the advantages and disadvantages of Q-Cogni. We perform experiments in which expand the Taxi-v3 environment into larger state sizes, represented by a grid of n $\times$ m rows and columns. We then compare the time taken to achieve an optimal tour against Djikstra’s algorithm and A∗ using a Manhattan distance heuristic.

5.2.1 Results and Discussion.

We report our comparison analysis across dimensionality, prior knowledge requirements, transportability between configurations and interpretability.

Scalability. Q-Cogni excels at large networks. Fig 6 shows the average time taken to identify the optimal tour for varying grid sizes representing a different scale of the Taxi-v3 environment. We performed experiments for grid sizes from 8 $\times$ 8, to 512 $\times$ 512 and implemented best-fit curves to extrapolate the increase in problem dimension.

We can observe that Q-Cogni takes orders of magnitude longer to identify the optimal tour for a low number of nodes. As the number of nodes increases Q-Cogni is much more efficient. This is a product of the sample efficiency delivered within the Q-Cogni framework where the causal component enables “shortcuting” of exploration requirements, thus reducing the need to proportionally increase the observations required by the agent.

A-priori knowledge requirement. Q-Cogni require no prior knowledge of the map. Shortest-path methods require prior knowledge of the graph structure to be effectively applied. For example, in our taxi problem, both Djikstra’s and A* require the map upfront. Q-Cogni requires the causal structure encoded as a graph, but does not require the map itself. This is a significant advantage to enable application in the real-world as a-priori knowledge can be limited for navigation applications.

Transferability. If the configuration within the map changes (e.g. the initial passenger position), Q-Cogni would not need to be retrained. The same agent trained in a configuration of a map can be deployed to another configuration seamlessly. On the other hand, if configuration changes take place, we would require to rerun the shortest-path search algorithms. Therefore, Q-Cogni has a significant advantage for dynamic settings, a common characteristic of real-world problems.

Interpretability. Shortest-path search methods are limited in interpretability of decisions made by the algorithm to derive the optimal tour. The causes in a particular edge is preferred over another are not explicitly described as part of their output. Q-Cogni not only is able to provide a full history of reasons in which each decision was made but also the causes and associated probabilities of alternative outcomes at each step. This is another significant advantage on the applicability of Q-Cogni for real-world problems in which there is an interface between humans and the agent.

5.3 Real-World Application: Routing New York City Taxis

We use the New York City Taxi & Limousine Commission trip record data, which contains all taxi movements in New York City from 2013 to date Taxi and Comission (2022), to validate the applicability of Q-Cogni in a real-world context. Figure 7 shows all pickup and drop-off locations of yellow cabs on the 15th of October 2022. We see the highest density of taxi trips being in Manhattan, represented on the left hand side of Figure 7. However, we choose the neighborhoods between Long Island City and Astoria, represented in the highlighted area of Figure 7 as they have a more challenging street configuration than Manhattan.

We used the OSMNX library Boeing (2017) to convert the street map into a graph representation where intersections are nodes and edges are streets. We created a custom OpenAI Brockman et al. (2016) gym environment to enable fitting of the Bayesian Network and training of Q-Cogni. The resulting graph is shown in Figure 8 containing 666 nodes and 1712 edges resulting in a state-action space of size 443,556, a problem $10^{3}$ larger than our Taxi-v3 environment.

We use the causal structure derived in Figure 3 and perform a random walk with 1,000,000 steps in the custom built environment to fit a Bayesian Network to the causal model. It is important to appreciate here the transferability and domain agnostic characteristic of our approach, where we leverage on the structure previously discovered for the same task but in a complete different map.

We train Q-Cogni once for 100,000 episodes and evaluate the trained agent against several trips contained in the original dataset without retraining each time the trip configuration changes in the real-world dataset. This is a significant benefit of the approach when comparing to shortest-path search methods. We also compare Q-Cogni results against Q-Learning to observe the effects of the causal model against policy quality and compare against Dijkstra’s algorithm to observe the effects of policy efficiency.

First, Figure 9 shows the optimal routes generated by Q-Learning and Q-Cogni after 100,000 training episodes for a selected route. We can see that Q-Cogni significantly improves the policy generation reaching a near-optimal shortest-path result post training whereas Q-Learning fails to detect the right pickup point and performs multiple loops to get to the destination. Across the 615 trips evaluated, Q-Learning was able to generate only 12% of routes in which had the same travel distance as Q-Cogni generated routes, with the remainder being longer. These results show the benefits of the causal module of Q-Cogni towards optimal learning.

In addition, Figure 10 shows a sample comparison of optimal routes generated with Dijkstra’s algorithm and Q-Cogni. We observe on the left picture that Q-Cogni generates a more efficient route (in red) than Dijkstra’s (in blue) measured as the total distance travelled. On the right picture Q-Cogni generates a significantly different route which is slightly worse, albeit close in terms of distance. Overall, across the 615 trips evaluated we report 28% where Q-Cogni generated a shorter route than Dijkstra’s, 57% were the same and 15% worse.

These results show the applicability of Q-Cogni for a real-world case with demonstrated (i) transferability - the causal network obtained in the Taxi-v3 environment can be used for the same task; (ii) no prior knowledge required - Q-Cogni does not need to have access to the global map; and (iii) explainability and expandability - where the Bayesian Network can be expanded to incorporate other causal relations such as traffic and weather. The application of Q-Cogni in this real-world dataset demonstrate a promising framework to bring together causal inference and reinforcement learning to solve relevant and challenging problems.

6 Conclusion

We have presented Q-Cogni, a novel causal reinforcement learning framework that redesigns Q-Learning with an autonomous causal structure discovery method and causal inference as a hybrid model-based and model-free approach.

We have implemented a framework that leverages upon a data-driven causal structure model discovered autonomously (but flexible enough to accommodate domain knowledge based inputs) and redesigned the Q-Learning algorithm to apply causal inference during the learning process in a reinforcement learning setting.

Our approach exploits the causal structural knowledge contained in a reinforcement learning environment to shortcut exploration requirements of a state space by the agent to derive a more robust policy with less training requirements as it increases the sample efficiency of the learning process. Together, these techniques are shown to achieve a superior policy, substantially improve learning efficiency, provide superior interpretability, efficiently scale with higher problem dimensions and more generalisable to varying problem configurations. While these benefits have been illustrated in the context of one specific application – the VRP problem in the navigation domain – it can be applied to any reinforcement learning problem that contains an environment with a implicit representation of the causal relationships between state variables and some level of prior knowledge of the environment dynamics.

We believe that the integration of causality and reinforcement learning will continue to be an attractive area towards human level intelligence for autonomous learning agents. One promising avenue of research is to broaden the integrated approach to continuous state-action spaces such as control environments, a current focus of our research.

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Amirinezhad et al. [2022] Amir Amirinezhad, Saber Salehkaleybar, and Matin Hashemi. Active learning of causal structures with deep reinforcement learning. Neural Networks , 2022.
2Arulkumaran et al. [2017] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A brief survey of deep reinforcement learning. ar Xiv preprint ar Xiv:1708.05866 , 2017.
3Ballesteros Silva and Escobar Zuluaga [2016] Pedro Pablo Ballesteros Silva and Antonio Escobar Zuluaga. Review of state of the art vehicle routing problem with pickup and delivery (vrppd). Ingeniería y Desarrollo , 34(2):463–482, 2016.
4Bannon et al. [2020] James Bannon, Brad Windsor, Wenbo Song, and Tao Li. Causality and batch reinforcement learning: Complementary approaches to planning in unknown domains. ar Xiv preprint ar Xiv:2006.02579 , 2020.
5Beaumont et al. [2021] Paul Beaumont, Ben Horsburgh, Philip Pilgerstorfer, Angel Droth, Richard Oentaryo, Steven Ler, Hiep Nguyen, Gabriel Azevedo Ferreira, Zain Patel, and Wesley Leong. Causal Nex, 10 2021.
6Boeing [2017] Geoff Boeing. Osmnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks. Computers, Environment and Urban Systems , 65:126–139, 2017.
7Braekers et al. [2016] Kris Braekers, Katrien Ramaekers, and Inneke Van Nieuwenhuyse. The vehicle routing problem: State of the art classification and review. Computers & Industrial Engineering , 99:300–313, 2016.
8Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540 , 2016.