New Subgraph Isomorphism Algorithms: Vertex versus Path-at-a-time   Matching

Mosab Hassaan; Karam Gouda

arXiv:1904.08819·cs.DS·April 19, 2019

New Subgraph Isomorphism Algorithms: Vertex versus Path-at-a-time Matching

Mosab Hassaan, Karam Gouda

PDF

Open Access

TL;DR

This paper introduces two novel algorithms, Fast-ON and Fast-P, for subgraph isomorphism that significantly outperform existing methods by applying vertex- and path-at-a-time matching strategies with effective heuristics.

Contribution

The paper presents two new algorithms for subgraph isomorphism, utilizing vertex- and path-at-a-time matching approaches with heuristics, achieving substantial speed improvements.

Findings

01

Fast-ON and Fast-P outperform existing algorithms by 1-4 orders of magnitude.

02

Both algorithms are effective for dense and sparse graphs.

03

Use of heuristics significantly reduces search space.

Abstract

Graphs are widely used to model complicated data semantics in many application domains. In this paper, two novel and efficient algorithms Fast-ON and Fast-P are proposed for solving the subgraph isomorphism problem. The two algorithms are based on Ullman algorithm [Ullmann 1976], apply vertex-at-a-time matching manner and path-at-a-time matching manner respectively, and use effective heuristics to cut the search space. Comparing to the well-known algorithms, Fast-ON and Fast-P achieve up to 1-4 orders of magnitude speed-up for both dense and sparse graph data.

Tables2

	$u_{1}$	$u_{2}$	$u_{3}$	$u_{4}$
$f_{1}$	$v_{1}$	$v_{2}$	$v_{4}$	$v_{3}$
$f_{2}$	$v_{2}$	$v_{1}$	$v_{3}$	$v_{4}$

	$u_{1}$	$u_{2}$	$u_{3}$
$f_{1}$	$v_{1}$	$v_{3}$	$v_{4}$
$f_{2}$	$v_{1}$	$v_{4}$	$v_{3}$
$f_{3}$	$v_{2}$	$v_{4}$	$v_{5}$
$f_{4}$	$v_{2}$	$v_{5}$	$v_{4}$

Equations4

∣ E_{q} ∣/∣ V_{q} ∣ < ma xL

∣ E_{q} ∣/∣ V_{q} ∣ < ma xL

d_{q} < 2. ma xL / (∣ V_{q} ∣ - 1)

d_{q} < 2. ma xL / (∣ V_{q} ∣ - 1)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGraph Theory and Algorithms · Advanced Graph Neural Networks · Data Management and Algorithms

Full text

New Subgraph Isomorphism Algorithms: Vertex versus

Path-at-a-time Matching

Mosab Hassaan

Karam Gouda

Faculty of Science, Benha University, Egypt

Faculty of Computers & Informatics, Benha University, Egypt

Abstract

Graphs are widely used to model complicated data semantics in many application domains. In this paper, two novel and efficient algorithms Fast-ON and Fast-P are proposed for solving the subgraph isomorphism problem. The two algorithms are based on Ullman algorithm [Ullmann (1976)], apply vertex-at-a-time match- ing manner and path-at-a-time matching manner respectively, and use effective heuristics to cut the search space. Comparing to the well-known algorithms, Fast-ON and Fast-P achieve up to 1-4 orders of magnitude speed-up for both dense and sparse graph data.

keywords:

Subgraph isomorphism, vertex-at-a-time matching, path-at-a-time matching

1 Introduction

As a popular data structure, graphs have been used to model many complex data objects and their relationships in the real world, such as the chemical compounds [Willett (1998)], entities in images [Petrakis and Faloutsos (1997)], and social networks [Cai et al. (2005)]. For example, in social network, a person $i$ corresponds to a vertex $v_{i}$ in the graph $G$ , and another person $j$ corresponds to a vertex $v_{j}$ in the graph $G$ . If persons $i$ and $j$ are acquaintances or they have a business relation, then an edge $(v_{i},v_{j})$ exists, which connects vertex $v_{i}$ and $v_{j}$ . Also in chemistry, a set of atoms combined with designated bonds are used to describe chemical molecules.

Subgraph isomorphism is an important and very general form of pattern matching that finds practical applications in areas such as pattern recognition and computer vision, computer-aided design, image processing, graph grammars, graph transformation, bio computing, search operation in chemical structural formulae database, and numerous others. Moreover, subgraph isomorphism checking is the basic and important operation in managing and analyzing graph data. In other words, it is the building block of many graph analysis and management activities. For example, in Frequent Subgraph Mining – a well-addressed problem in graph data analysis – the objective is to extract all subgraphs in a given set of data graphs, that occur in at least a specified number of data graphs. The core in solving this problem is subgraph isomorphism checking. The reason is given as follows. One main challenge in frequent subgraph mining is to count how many data graphs containing each given candidate subgraph. This involves subgraph isomorphism checking between the candidates and each data graph. Another example is the well-known Subgraph Search, an important problem in graph data management. The objective of subgraph search is to retrieve data graphs that contain a query graph as a subgraph. Subgraph isomorphism checking plays an important role in any solution to this problem.

Informally, two graphs $H$ and $G$ are isomorphic if it is possible to redraw one of them, say $G$ , so it appears to be identical to $H$ . In other words, it asks whether there is a one-to-one mapping between the vertices of the two graphs, preserving vertex connections (the edges). On the other hand, the subgraph isomorphism problem asks the following question. Given two graphs $H$ and $G$ , is $H$ isomorphic to any subgraph of $G$ ? Graph isomorphism is neither known to be solvable in polynomial time nor NP-complete, while subgraph isomorphism is known to be NP-complete [Garey and Johnson (1990)].

Contribution. In this paper, we propose two new algorithms for subgraph isomorphism checking. These algorithms are based on Ullman algorithm and improve upon it by reducing its search space. The first algorithm reduces the search space size by utilizing the label information of vertex’s neighborhood, and speeding up the search by following a novel ordering strategy of the query’s vertices. The algorithm is called Fast-ON. Comparing to the well-known algorithms Ullman [Ullmann (1976)] and Vflib [Cordella et al. (2004)], Fast-ON achieves up to 1-3 orders of magnitude speed-up.

The second algorithm explores the possibility of leveraging substructural matching instead of vertex matching. In fact, substructure matching will cut down the depth of the search tree, and reduce the search size as the matching candidates will also be minimized accordingly. This new algorithm follows a path-at-a-time matching manner, and called Fast-P. To speed up the search in Fast-P, we propose an ordering of the query paths to force false mappings to be discarded as early as possible during the search. Comparing to the well-known algorithms Ullman [Ullmann (1976)] and Vflib [Cordella et al. (2004)], Fast-P achieves up to 1-4 orders of magnitude speed-up.

Organization. This paper is organized as follows. Section 2 defines the preliminary and concepts. Section 3 presents the related work. Section 4 presents our two new algorithms (Fast-ON and Fast-P). Section 5 reports the experimental results. Finally, Section 6 concludes the paper.

2 Preliminaries

In this section, we introduce the fundamental concepts. Let $\Sigma$ be a set of discrete-valued labels. A labeled graph is a 3-tuple, $G=(V_{G},E_{G},l_{G})$ where $V_{G}$ is a set of vertices. Each $v\in V_{G}$ is a unique ID representing a vertex, $E_{G}\subseteq V_{G}\times V_{G}$ is a set of edges (directed or undirected), and $l_{G}:V_{G}\cup E_{G}\longmapsto\Sigma$ is a function assigning labels to the vertices and edges of the graph. A labeled graph $G$ is said to be connected, if each pair of vertices $v_{i},v_{j}$ $\in$ $V_{G}$ , $i\neq j$ , are directly or indirectly connected. This paper focuses on undirected, simple (no self-loops, no duplicate edges), labeled, and connected graphs. Given a graph $G$ , we define the set of adjacent vertices (or neighbors) of a vertex $v\in G$ as $adj_{G}(v)=\{u:(v,u)\in E_{G}\}$ , and the degree of $v$ as $deg_{G}(v)=|adj_{G}(v)|$ . The size of $G$ is denoted by $|G|=|E_{G}|$ . In what follows, a labeled graph is simply called a graph unless stated otherwise.

Definition 2.1.

Labeled Paths.* A path $p=u\rightsquigarrow u^{\prime}$ from a vertex $u$ to a vertex $u^{\prime}$ in a labeled graph $G$ is a sequence $v_{0},v_{1},\ldots,v_{k}$ of vertices such that $u=v_{0}$ and $u^{\prime}=v_{k}$ , and $(v_{i-1},v_{i})\in E_{G}$ $\forall i=1\ldots k$ . In other words, it is a sequence of edges connecting two vertices $u\in V_{G}$ , $u^{\prime}\in V_{G}$ . If the vertex label is used instead of its id, for each vertex in the path, the path is called labeled path. $\blacksquare$ *

A path without repetitive vertices is often referred to as a simple path. A cycle is a special path with at least three edges, in which the first and last vertices are identical, but otherwise all vertices are distinct.

Definition 2.2.

Graph Isomorphism.* Given two graphs $H=(V_{H},E_{H},l_{H})$ and $G=(V_{G},E_{G},l_{G})$ . A graph isomorphism from $H$ to $G$ is a bijection $f:V_{H}\longmapsto V_{G}$ such that:*

$(u,v)\in E_{H}$ * iff $(f(u),f(v))\in E_{G}$ ,* 2. 2.

$l_{H}(u)=l_{G}(f(u))$ * $\forall u\in V_{H}$ , and* 3. 3.

$l_{H}((u,v))=l_{G}((f(u),f(v)))$ . $\blacksquare$

In other words, the isomorphism $f$ preserves the edge adjacencies, as well as the vertex and edge labels. If the function $f$ is only injective but not bijective, we say that $H$ is isomorphic to a subgraph of $G$ , or subgraph isomorphic to G, denoted $H\subseteq G$ . In this case we also say that $G$ contains $H$ .

A graph automorphism is an isomorphism from the graph to itself. Given a graph $G$ , the group of all its isomorphic graphs are called an automorphism group. The graph $G$ may also contain many occurrences (embeddings) of the subgraph $H$ . Two embeddings are considered redundant if their corresponding subgraphs are automorphic.

Example 1

In Figure 1, $G_{1}$ and $G_{2}$ are isomorphic graphs. An example of an isomorphism is $f(v_{1})=u_{1}$ , $f(v_{2})=u_{2}$ , $f(v_{3})=u_{3}$ , and $f(v_{4})=u_{4}$ . In Figure 2, $q$ is subgraph isomorphic to $G$ . An example of an subgraph isomorphism is $f(u_{1})=v_{1}$ , $f(u_{2})=v_{3}$ and $f(u_{3})=v_{4}$ . There are several possible graph or subgraph isomorphisms between two graphs. The set of all possible graph isomorphisms from $G_{1}$ to $G_{2}$ are shown in Figure 3(a). Also, the set of all possible subgraph isomorphisms from $q$ to $G$ are shown in Figure 3(b). The subgraphs identified by the two mappings $f_{1}$ and $f_{2}$ are redundant. So $f_{3}$ and $f_{4}$ . $\blacksquare$

3 Related Work

A straightforward approach to check subgraph isomorphism between the graph query $q$ against a data graph $G$ is to explore a tree-structured search space considering all possible vertex-to-vertex correspondences from $q$ to $G$ . The search space traversal is halted until the structure of $q$ implied by the vertex mapping does not correspond in $G$ , while reaching a leaf node of the search space means successfully mapping all vertices of $q$ upon $G$ without violating the structure and label constraints of subgraph isomorphism, and it is, therefore, equivalent to having found a matching of $q$ in $G$ .

The tree in Fig. 4 shows a part of the search space generated from testing the two graphs $q$ and $G$ in Fig. 2 for subgraph isomorphism. This space enumerates all possible mappings between the vertices of the two graphs. At level $i$ of the tree, a vertex $u_{i}$ in $V_{q}$ is mapped to some vertex in $V_{G}$ (the number $j$ inside each node in the search tree means that this node represents the vertex $v_{j}\in V_{G}$ ). The root node of the search tree represents the starting point of the search, inner nodes of the search tree correspond to partial mappings, and nodes at level $|V_{q}|$ represent complete – not necessarily sub-isomorphic – mappings. If there exists a complete mapping that preserves adjacency in both $q$ and $G$ , then we have $q$ is subgraph isomorphic to $G$ , otherwise $q$ is not subgraph isomorphic to $G$ . The bold path in the tree, ( $u_{1}$ is mapped to $v_{1}$ , $u_{2}$ is mapped to $v_{3}$ , and $u_{3}$ is mapped to $v_{4}$ ), is a complete mapping that preserves adjacency in $q$ and $G$ , thus $q$ is subgraphs isomorphic to $G$ .

Definition 3.1.

Matching Candidate Set.* Given a vertex $u\in V_{q}$ , the matching candidate of $u$ is a set $Cand(u)$ of vertices in $G$ sharing the same vertex label with $u$ , i.e., $Cand(u)$ = $\{v\in V_{G}:l_{q}(u)=l_{G}(v)\}$ . $\blacksquare$ *

Thus, in the naive approach, for each vertex $u\in V_{q}$ , an exhaustive search of possible one-to-one correspondences to $v\in Cand(u)$ is required. Therefore, the total search space of the naive algorithm is $\prod_{i=1}^{N}Cand(u_{i})$ , where $N=|V_{q}|$ . The worst-case time complexity of the algorithm is $O(M^{N})$ , where $M=|V_{G}|$ and $N=|V_{q}|$ . This is a consequence of subgraph isomorphism that is known to be NP-complete. In practice, the running time depends tightly on the size of the search space, $\prod_{i=1}^{N}|Cand(u_{i})|$ .

3.1 Ullman Algorithm

Ullman algorithm [Ullmann (1976)] is the earliest and highly-cited approach to the subgraph isomorphism problem. Given a query graph $q$ and a data graph $G$ . To check if $q$ is subgraph of $G$ , Ullman’s basic approach is to enumerate all possible mappings of vertices in $V_{q}$ to those in $V_{G}$ using a depth-first tree-search algorithm. In order to cope with subgraph isomorphism problem efficiently, Ullman proposed a refinement procedure to prune the search space. It is based on the following three conditions:

Label and degree condition. A vertex $u\in V_{q}$ can be mapped to $v\in V_{G}$ under injective mapping $f$ , i.e $v=f(u)$ , if

(i) $l_{q}(u)$ = $l_{G}(v)$ , and

(ii) $deg_{q}(u)\leq deg_{G}(v)$ . 2. 2.

One-to-One mapping of vertices condition. Once a vertex $u\in V_{q}$ is mapped to $v\in V_{G}$ , we cannot map any other vertex in $V_{q}$ to the vertex $v$ . 3. 3.

Neighbor condition. By this condition Ullman algorithm examines the feasibility of mapping $u\in V_{q}$ to $v\in V_{G}$ by considering the preservation of structural connectivity. If there exist edges connecting $u$ with previously explored vertices of $q$ but there are no counterpart edges in $G$ , the mapping test simply fails.

Applying the above three conditions, $|Cand(u)|$ for each $u\in V_{q}$ could be decreased; thus cutting down the search space.

3.2 QuickSI Algorithm

QuickSI [Shang et al. (2008)] is a recent subgraph checking algorithm. It is based on Ullman, and improve upon it by speeding up Ullman’s search. The underlying observation behind developing QuickSI algorithm is noting that the Ullman’s search is random. Ullman usually matches query vertices in the input order. Some orderings do not preserve connectivity between consecutive query vertices, which requires Ullman to consume a lot of time checking the feasibility of partial mappings. Instead of trivially enumerating mappings according to the given order of $V_{q}$ , QuickSI enumerates mappings from a spanning tree of $V_{q}$ to $V_{G}$ to reduce the combinations by the connectivity restriction.

QuickSI proposes to follow a search order given by the $QI$ - $Sequence$ . $QI$ - $Sequence$ is a sequence that represents a rooted spanning tree, $t_{q}$ , for $q$ and consists of a list of spanning entries, $T_{i}$ , for $1\leq i\leq|V_{q}|$ , where each $T_{i}$ keeps the basic information of the spanning tree of $q$ . In $QI$ - $Sequence$ , a $T_{i}$ may be followed by a list of extra entries, $R_{ij}$ , which keeps the extra topology information related to the corresponding spanning entry. To identify a subgraph isomorphic mapping from $q$ to $G$ , QuickSI iteratively grows each possible mapping on $t_{q}$ in a depth-first manner according to the vertices order in $QI$ - $Sequence$ . QuickSI can terminate earlier if a prefix of $QI$ - $Sequence$ cannot be sub-isomorphically mapped to $G$ . To effectively reduce the search costs, the authors propose to reorder the $QI$ - $Sequence$ as follows. Pick up the vertex $v$ from $q$ , such that its label has the lowest occurrence in the graph $G$ , as the the first entry in $QI$ - $Sequence$ . Then, iteratively pick up an unchosen vertex such that the spanning edge has the lowest occurrence in the graph $G$ among all valid options.

3.3 Vflib Algorithm

The Vflib algorithm [Cordella et al. (2004)] is another important algorithm for subgraph isomorphism problem. It uses a different strategy from Ullman algorithm. Vflib proceeds by creating and modifying a match state. The match state contains a matched-set, which is a set of vertex pairs that match between the query graph $q$ and data graph $G$ . If the matched-set contains all of the query graph $q$ , then the algorithm is successful and returns. Otherwise, the algorithm attempts to add a new pair. It does this by tracking the set of vertices immediately adjacent to the matched-set. This set defines the potential vertices that can be added to a given state. The only pairs that can be added are either in the adjacent sets of both graphs. The algorithm uses backtracking search to find either a successful match state, or return a failure.

4 New Subgraph Isomorphism Algorithms

Clearly, the subgraph isomorphism checking is very costly, and it becomes even challenging when the graph and the query are large and dense. In order to alleviate the time consuming search considered by previous algorithms, we consider reducing the search space size $\prod_{i=1}^{N}|Cand(u_{i})|$ in the following two aspects:

•

Minimize $|Cand(u)|$ for each vertex $u\in V_{q}$ .

•

Minimize the number of one-to-one correspondence checking, i.e., minimize $N$ .

In this paper, we propose two new algorithms for subgraph isomorphism checking. These algorithms are based on Ullman algorithm and improve upon it by reducing its search space. The first algorithm reduces the search space size by utilizing the label information of vertex’s neighborhood, and speeding up the search by following a novel ordering strategy of the query’s vertices. The algorithm is called Fast-ON (which stands for the bold letters in: Fast subgraph testing by Ordering the query’s vertices and utilizing labeled Neighborhood information). Comparing to the well-known algorithms Ullman [Ullmann (1976)] and Vflib [Cordella et al. (2004)], Fast-ON achieves up to 1-3 orders of magnitude speed-up. Fast-ON algorithm is published in [Gouda and Hassaan (2012)].

The second algorithm explores the possibility of leveraging substructural matching instead of vertex matching to minimize $N$ . In fact, substructure matching will cut down the depth of the search tree, and consequently the search size as the matching candidates will also be minimized. This new algorithm follows a path-at-a-time matching manner, and called Fast-P which stands for the bold letters in: Fast Path-at-a-time manner. To speed up the search in Fast-P, we propose an ordering of the query paths to force false mappings to be discarded as early as possible during the search. In Section 4.2, Fast-P algorithm is discussed in details. Next, we introduce Fast-ON algorithm.

4.1 Fast-ON Algorithm

The search space considered by Ullman algorithm is still huge even after using the refinement procedure. Fast-ON explores much smaller space than that of Ullman algorithm by utilizing vertex neighborhood as in the following optimization.

4.1.1 Opt1: Utilizing Neighborhood Labels

Here, we introduce a condition effective in reducing the search space. It is based on the neighborhood labels of matching vertices. This new condition is much stronger than the label and degree condition of the refinement procedure in Ullman algorithm. First, we define the labeled neighborhood of any vertex as follows.

Definition 4.1.

Vertex Labeled Neighborhood.* Given a graph $G$ and a vertex $u\in V_{G}$ , the labeled neighborhood of $u$ is given as $NL_{G}(u)$ = $\{(l_{G}(v),l_{G}((u,v))):v\in V_{G}$ and $(u,v)\in E_{G}\}$ . $\blacksquare$ *

The following theorem presents the necessary condition required to map a vertex $u\in V_{q}$ to a vertex $v\in V_{G}$ .

Theorem 4.2.

Given two graphs $q$ and $G$ such that $q$ is subgraph isomorphic $G$ under injective function f. If $u\in V_{q}$ is mapped to $v\in V_{G}$ , then $NL_{q}(u)\subseteq NL_{G}(v)$ . $\blacksquare$

Thus, according to Theorem 4.2, if the labeled neighborhood of a vertex $v\in V_{G}$ does not contain the labeled neighborhood of a vertex $u\in V_{q}$ , $u$ can not be mapped to $v$ . We can reduce the search space by enforcing this inclusion test. Next condition generalizes the first condition of the refinement procedure in Ullman algorithm by adding this inclusion test.

Label and neighborhood inclusion condition. A vertex $u\in V_{q}$ can be mapped to $v\in V_{G}$ under injective function $f$ , i.e $v=f(u)$ , if

(i) $l_{q}(u)$ = $l_{G}(v)$ , and

(ii) $NL_{q}(u)\subseteq NL_{G}(v)$ .

Note that if $NL_{q}(u)\subseteq NL_{G}(v)$ is satisfied, it directly leads to $deg(u)\leq deg(v)$ since $deg(v)=|NL_{G}(v)|$ for simple graphs.

Example 2

Consider the two graphs $q$ and $G$ given in Figure 2. According to the label and neighborhood inclusion condition, we can map vertex $u_{1}\in V_{q}$ to $v_{1}\in V_{G}$ since (i) $l_{q}(u_{1})$ = $l_{{G}}(v_{1})=A$ , and (ii) $NL_{q}(u_{1})=\{(B,Y),(B,Y)\}\subseteq\{(A,X),(B,Y),(B,Y)\}=NL_{G}(v_{1})$ . $\blacksquare$

Though the label and neighborhood inclusion condition is effective in reducing the search space, applying the inclusion test is expensive especially for large size graphs with higher average vertex degree. Below, we propose a new method to efficiently apply the inclusion test. The method is based on the observation that many vertices in the query or data graph share the same neighborhood. The next example highlights this fact.

Example 3

Consider the query graph $q$ and data graph $G$ given in Figure 2. We have (1) In graph $G$ : $NL_{G}(v_{1})$ = $NL_{G}(v_{2})$ = $\{(A,X),(B,Y),(B,Y)\}$ , $NL_{G}(v_{3})$ = $NL_{G}(v_{5})$ = $\{(A,Y),(B,Z)\}$ , and $NL_{G}(v_{4})$ = $\{(A,Y),(A,Y),(B,Z),(B,Z)\}$ ; (2) In query graph $q$ : $NL_{q}(u_{1})$ = $\{(B,Y),(B,Y)\}$ , and $NL_{q}(u_{2})$ = $NL_{q}(u_{3})$ = $\{(A,Y),(B,Z)\}$ . $\blacksquare$

Based on the above observation, we can reduce the cost of the containment checks by caching most of the repeated computations, as in the following steps:

Find the set of distinct labeled neighborhoods for the two graphs $q$ and $G$ , denoted as $DLN_{G}$ and $DLN_{q}$ , respectively. 2. 2.

Construct a bit matrix $M_{DLN}=(m_{ij})_{\alpha\beta}$ where $\alpha=|DLN_{q}|$ and $\beta=|DLN_{G}|$ , to maintain the inclusion relationship between distinct neighborhoods of $q$ and $G$ , that is, $m_{ij}=1$ if $DLN_{q}[i]\subseteq DLN_{G}[j]$ , otherwise $m_{ij}=0$ . 3. 3.

For a graph $g$ , where $g$ is $q$ or $G$ , construct an array of pointers $P_{g}$ of size $|V_{g}|$ , called position array, where each slot $u$ holds the index of the vertex $u$ labeled neighborhood at $DLN_{g}$ .

Now we can say that, for each $u\in V_{q}$ and $v\in V_{G}$ , we have $NL_{q}(u)\subseteq NL_{G}(v)$ iff $m_{{P_{q}(u)}{P_{G}(v)}}=1$ . Thus, the test (ii) in label and neighborhood inclusion condition can be replaced by testing if $m_{{P_{q}(u)}{P_{G}(v)}}=1$ .

In subgraph search problem, for example, caching the repeated computations as above is very useful since real graph data tend to share commonality, that is, a vertex may appear in many data graphs. This happens because the real data come from the same application domain. Note that in the experiments, subgraph search problem is used for testing Fast-ON algorithm.

To speed up the search in Fast-ON, we propose and ordering methodology of the query vertices as we show in the the following optimization.

4.1.2 Opt2: Ordering the query vertices

This optimization is based on the observation that the search order in Ullman algorithm is random. It depends on the order of query vertices imposed during input. This default ordering of $V_{q}$ can possibly result in a search order that seriously slows down Ullman Algorithm. Query vertices should be explored in the order that facilitates getting the utmost benefit of applying the third condition. Unlike the QuickSI algorithm, our approach to order $V_{q}$ is to require the currently processing query vertex to have high connectivity with the previously explored ones, that is, suppose that $u_{i}\in V_{q}$ is the currently processing vertex, then $u_{i}$ should have the higher connectivity with $u_{1},u_{2},\ldots,u_{i-1}$ among the remaining ones. Whereas, the first vertex to explore, i.e., $u_{1}$ , is the one with maximum degree. This new ordering forces false mapping to be discarded as early as possible during the search, thus saving much of the time that Ullman algorithm may take on false long partial mappings. Figure 5 outlines this idea.

4.1.3 Fast-ON Pseudocode

Figure 6 outlines Fast-ON algorithm. Line 1 applies the second optimization Opt2, whereas lines 2-5 outline the first optimization Opt1. In line 5, for each query vertex $u\in V_{q}$ , data graph vertices $v\in V_{G}$ that satisfy the modified first condition are collected into a set called candidate set $Cand(u)$ . The procedure $Recursive\_Search$ matches $u_{i}$ over $Cand(u_{i})$ (line 5) and proceeds step-by-step by recursively matching the subsequent vertex $u_{i+1}$ over $Cand(u_{i+1})$ (lines 6-7), or sets the Test variable to true value and returns if every vertex of $q$ has counterpart in $G$ (line 9). If $u_{i}$ exhausts all vertices in $Cand(u_{i})$ and still cannot find matching, Recursive_Search backtracks to the previous state for further exploration (line 11). The procedure Matchable applies the third condition.

Note that according to Opt1, for each $u$ , $Cand(u)$ is as small as possible. Consequently Fast-ON explores much smaller space than Ullman algorithm. Moreover, according to Opt2, false mappings are discarded as early as possible, saving much of the computations spent by Ullman algorithm.

4.2 Fast-P Algorithm

The vertex-to-vertex matching used in Ullman and Fast-ON is time consuming specially when $N=|V_{q}|$ is large. Recall that $N$ represents the depth of the search tree. In this section, we propose a new algorithm for subgraph isomorphism problem that uses substructure correspondences instead of vertex correspondences to reduce the depth of the search tree. Intuitively, if we index a set of substructures of the data graph $G$ , $S=\{s_{1},s_{2},\ldots\}$ , such that $s_{i}\subset G$ , and answer subgraph isomorphism in a structure-at-a-time manner by checking one-to-one correspondence on query’s substructures instead of query’s vertices, we definitely reduce the depth of the search space. In other words, we can minimize the depth of the the search tree of Ullman algorithm by matching a substructure per iteration. Applying this idea, two challenges will arise which are as follows.

•

The First Challenge. Which kind of substructures will efficiently work?

•

The Second Challenge. How these substructures are extracted and used?

Regarding the first challenge, there are three kinds of substructures that can be indexed, that are paths, trees, and graphs. We use paths for the following reasons:

Enumerating paths in a given graph $G$ is simple and easy while enumerating general subgraphs or simply trees is quite expensive. 2. 2.

Manipulating paths is much easier than that for general subgraphs. For instance, the number of redundancies of every path’s embedding is at most two, while it could be much larger than two for general subgraphs, which adds extra overhead for the case of general subgraphs. The main cause of redundancy will be discussed in more details below.

The new algorithm, called Fast-P (Fast Path-at-a-time manner algorithm), explores a tree-structured search space considering all possible path-to-path mappings from $q$ to $G$ . Each path corresponding to a query path is, in fact, a local match to its corresponding query path. If the query is subgraph isomorphic to the data graph, then some of these local matches could be combined together to produce a global match to the query. In what follows, we show how paths are extracted and efficiently used in Fast-P (the second challenge).

4.2.1 Path Enumeration and Encoding in Fast-P

Since the strategy of Fast-P is based on path-to-path matching, we first enumerate and index simple paths in the data graph $G$ . Usually, the number of paths in $G$ is large. Thus, we will use a path’s size parameter, called $maxL$ , to control the number of indexed paths in $G$ . We use ${\cal P}_{G}$ to denote the set of simple paths of size up to $maxL$ in a graph $G$ . To deal with the issue of redundancy while path enumeration, we introduce the following concepts.

Definition 4.3.

Reversed Path.* Given a path $p=v_{1}\rightsquigarrow v_{k}$ in a graph $G$ , its reversed path is a path $v_{k}\rightsquigarrow v_{1}$ and denoted by $p^{\tt r}$ . $\blacksquare$ *

Definition 4.4.

(Non-)Iso Path.* A path $v_{1}\rightsquigarrow v_{k}$ in a graph $G$ is called an iso path if $l(v_{i})=l(v_{k-i+1})$ and $l((v_{i},v_{i+1}))=l((v_{k-i},v_{k-i+1}))$ $\forall$ $i=1,2,\ldots,k/2$ , otherwise it is called a non-iso path. $\blacksquare$ *

Example 4

The path $p_{1}=(v_{1},v_{2},v_{3})$ in Figure 7 is called iso path since $l(v_{1})=l(v_{3})=B$ and $l((v_{1},v_{2}))=l((v_{2},v_{3}))=Z$ , while the path $p_{2}$ is called non-iso path since $l(v_{1})=A\neq B=l(v_{3})$ . Finally, $p_{1}^{r}=(v_{3},v_{2},v_{1})$ . $\blacksquare$

Lemma 4.5.

Every embedding of an iso path $p$ has two redundancies, $p$ and $p^{r}$ .

Example 5

Given the two paths $p_{1}$ and $p_{2}$ , the tree $T_{1}$ , and the two graph $G_{1}$ and $G$ in Figure 7. The iso path $p_{1}$ has two redundant embeddings in $G$ , that are, $\{v_{3},v_{4},v_{5}\}$ and $\{v_{5},v_{4},v_{3}\}$ while the non-iso path $p_{2}$ has only one redundant embedding, that is, $\{v_{2},v_{1},v_{4}\}$ . The tree $T_{1}$ and subgraph $G_{1}$ have four redundant embeddings in $G$ , which are $\{v_{1},v_{2},v_{3},v_{4},v_{5}\}$ , $\{v_{2},v_{1},v_{3},v_{4},v_{5}\}$ , $\{v_{1},v_{2},v_{5},v_{4},v_{3}\}$ , and $\{v_{2},v_{1},v_{5},v_{4},v_{3}\}$ . $\blacksquare$

Storing and comparing paths would require a good representation of path embeddings. To do so, consider the following concepts.

Definition 4.6.

Canonical Path.* The code of a path $p=v_{1}\rightsquigarrow v_{k}$ , denoted as $code(p)$ , is a sequence of vertex and edge labels in the following order: $"l(v_{1})l((v_{1},v_{2}))l(v_{2})\ldots l(v_{k-1})l((v_{k-1},v_{k}))l(v_{k})"$ . The path $p$ is called canonical, denoted $p^{c}$ , if its code is the lexicographically minimum of $code(p)$ and $code(p^{r})$ . $\blacksquare$ *

Corollary 4.7.

Every iso path $p$ is canonical.

proof: This is because $code(p)=code(p^{r})$ . $\blacksquare$

Example 6

Consider the two paths $p_{1}$ and $p_{2}$ in Figure 7, we have $p_{1}$ is canonical since it is iso path, and the path $p_{2}$ is canonical since $code(p_{2})$ = $"AXAYB"\leq"BYAXA"=code(p_{2}^{c})$ . $\blacksquare$

4.2.2 Path Matching in Fast-P

Usually, the number of paths in a query that are candidates for matching is much larger than the number of vertices, i.e., $|{\cal P}_{q}|\gg|V_{q}|$ . Thus, for Fast-P to be effective, the number of query’s paths used for matching should be less than the number of query’s vertices. Considering disjoint paths of size up to $maxL$ , denoted as $DP_{maxL}(q)$ , which cover the query, is a key step toward reaching this objective. Disjoint paths are defined as follows.

Definition 4.8.

Disjoint Paths.* Distinct paths in a graph $q$ are called disjoint if they are edge disjoint, but not necessarily node disjoint. $\blacksquare$ *

Example 7

Suppose that the graph $G$ in Figure 7 is our query $q$ , and set $maxL=2$ . There are 21 paths in ${\cal P}_{q}$ given as: ${\cal P}_{q}=$ { $\{1,2,4\},\{1,2,5\},\{1,3,4\},\{1,4,2\},\{1,4,3\},\{1,4,5\},\{2,1,3\},\{2,1,4\},\{2,4,3\},\{2,4,5\},\\ \{2,5,4\},\{3,1,4\},\{3,4,5\},\{4,2,5\},\{1,2\},\{1,3\},\{1,4\},\{2,4\},\{2,5\},\{3,4\},\{4,5\}$ }. The following paths are disjoint paths covering $q$ , $DP_{2}(q)$ = { $\{5,2,4\},\{2,1,4\},\{3,4,5\},\{1,3\}$ }. Compare $|DP_{2}(q)|=4$ with $|V_{q}|=7$ , we can save three call with Fast-P. $\blacksquare$

Thus, the total search space of Fast-P is given by the product $\prod^{|DP_{maxL}(q)|}_{i=1}Cand(p_{i})$ , where $Cand(p_{i})$ = $\{p^{\prime}\in{\cal P}_{G}:code(p^{\prime^{c}})=code(p_{i}^{c})\}$ is the set of graph paths that match a query path $p_{i}$ . To optimize Fast-P, query paths should be chosen such that $|DP_{maxL}(q)|$ and $|Cand(p_{i})|$ are minimized. The first optimization we introduce, called Opt1, minimizes $|DP_{maxL}(q)|$ . Another optimization called Opt2 is used to minimize the set of matching candidates $|Cand(p_{i})|$ for each query path $p_{i}$ . Finally, to speed up the search in Fast-P, we propose an ordering of the query paths to force false mappings to be discarded as early as possible during the search. This ordering is presented in the third optimization, called Opt3.

4.2.3 Opt1: Minimizing $|DP_{maxL}(q)|$

For a given query graph, there are multiple disjoint path decompositions. Some are compact and the others are not. The algorithm in Figure 8 finds a compact set of disjoint paths that cover $q$ . The algorithm works as follows. Given the set of all limited-size, simple paths ${\cal P}_{q}$ generated from the query $q$ . ${\cal P}_{q}$ is processed in descending order of path size. For each encountered path $p\in{\cal P}_{q}$ , we check if removing $p$ from the query disconnects it or not. If so, i.e., the resulting graph is disconnected, $p$ is not considered and the search continue for another one. If, on the other hand, the resulting graph still connected, $p$ is selected to be in the cover and removed from the query. Theorem 4.9 shows that the selected paths $DP(q)$ are disjoint, and if $maxL=2$ , then $DP(q)$ is compact.

Theorem 4.9.

Given ${\cal P}_{q}$ , the set of $q$ simple paths of size up to $maxL$ -edges. The set $DP_{maxL}(q)$ returned by the algorithm in Figure 8 is the set of disjoint paths covering $q$ . If $maxL=2$ , then $DP_{maxL}(q)$ is compact.

proof: A path of the largest length $p\in{\cal P}_{q}$ is inserted into $DP_{maxL}(q)$ (line 7) and removed from $q^{\prime}$ (line 6) if it fully exists in $q^{\prime}$ , i.e., if $p\subseteq q^{\prime}$ (line 5). This guarantees that all chosen paths do not share any edge, i.e., they are disjoint.

Suppose that $DP_{maxL}(q)$ is not compact and $maxL=2$ . Then, there exist at least two 1-edge paths $p_{1}$ and $p_{2}$ in $DP_{maxL}(q)$ such that the path $p=p_{1}\cup p_{2}$ is not chosen by the algorithm. Since $p_{1}\subseteq q^{\prime}$ and $p_{2}\subseteq q^{\prime}$ , then the only reason to not choose $p$ is that $p$ disconnects $q^{\prime}$ . On the other hand, since removing $p_{1}$ or $p_{2}$ leaves $q^{\prime}$ connected, then removing $p$ also leaves $q^{\prime}$ connected, i.e., $p$ should have been chosen, a contradiction. $\blacksquare$

According to Theorem 4.9, if we set $maxL=2$ , then $DP_{maxL}(q)$ is compact and we have two cases with respect to the number of edges in $q$ as follows.

•

If $|E_{q}|$ is even then $DP_{maxL}(q)$ contains $\lfloor|E_{q}|/2\rfloor$ paths of size 2 (i.e. $|DP_{maxL}(q)|=\lfloor|E_{q}|/2\rfloor$ ).

•

If $|E_{q}|$ is odd then $DP_{maxL}(q)$ contains $\lfloor|E_{q}|/2\rfloor$ paths of size 2 and one path of size 1 (i.e. $|DP_{maxL}(q)|=\lfloor|E_{q}|/2\rfloor+1$ ).

Example 8

Consider the query $q$ in Figure LABEL:fig:database and set $maxL=2$ . Since $|E_{q}|=7$ is odd then $|DP_{2}(q)|=\lfloor 7/2\rfloor+1=4$ . The following disjoint paths are generated using the algorithm in Figure 8: $DP_{2}(q)$ = ${\large\{}\{3,1,2\},\{1,4,2\},\\ \{5,2,3\},\{4,5\}{\large\}}$ . The size of $DP_{2}(q)$ is optimal. $\blacksquare$

Unfortunately, there is a tradeoff between the number of calls (depth of the search space) in Fast-P and the $maxL$ used. For instance, suppose the query $q$ is a complete graph such that $|V_{q}|=7$ then $q$ has $|E_{q}|=(|V_{q}|.(|V_{q}|-1))/2=21$ edges. Choosing $maxL=1$ , Algorithm Cover will produce $|DP_{1}(q)|=21$ disjoint paths, i.e., the number of edges in $q$ . Setting $maxL=2$ , we still have 11 disjoint paths that cover $q$ . Comparing with $|V_{q}|=7$ , substructure matching of paths of size 2 is not effective in this case.

To guarantees a higher efficiency than that of vertex-at-a-time approaches, $maxL$ must be chosen according to the following equation.

[TABLE]

To set equation 1 in terms of graph density, where the density of query $q$ is defined as $d_{q}=2.|E_{q}|/(|V_{q}|.(|V_{q}|-1))$ . Then equation 1 will be given as:

[TABLE]

This equation shows the role that query density plays in the performance of Fast-P. Dense queries require higher $maxL$ . Fortunately, the real data and the queries are always sparse graphs.

Example 9

Consider the query $q$ in Figure 2. Since $|V_{q}|=3$ and $|E_{q}|=3$ , setting $maxL=2$ will make Fast-P faster than Fast-ON. $\blacksquare$

4.2.4 Opt2: Minimizing $|Cand(p_{i})|$

For each query path $p$ , $Cand(p)$ is guaranteed to be smaller than $\prod_{v_{i}\in p}Cand(v_{i})$ . This is because vertex connections are already considered in the paths. For instance, consider a query path $p=(v_{1},v_{2},v_{3})$ , and given $Cand(v_{1})$ , $Cand(v_{2})$ , and $Cand(v_{3})$ . There are $Cand(v_{1})\times Cand(v_{2})\times Cand(v_{3})$ combinations to be considered in any vertex-to-vertex manner algorithm. On the other hand, the number $|Cand(p)|$ is much smaller than the previous product, since all paths connecting the vertices in $Cand(v_{1})$ , $Cand(v_{2})$ , and $Cand(v_{2})$ are the only considered ones. Hereafter, we optimize $Cand(p)$ , i.e., reduce the candidate set of each path $p\in DP_{maxL}(q)$ more than ever, by utilizing the neighborhood labels of all vertices in $p$ .

The next theorem presents the necessary condition required by any data graph path $p^{\prime}\in Cand(p)$ to share in any subgraph isomorphism between $q$ and the data graph $G$ .

Theorem 4.10.

If the query graph $q$ is subgraph isomorphic to the data graph $G$ , then for any $p^{\prime}\in Cand(p)$ sharing in the isomorphism, $p^{c}=(u_{1},\ldots,u_{k})$ and $p^{\prime^{c}}=(v_{1},\ldots,v_{k})$ must satisfy

$NL_{q}(u_{i})\subseteq NL_{G}(v_{i})$ * $\forall$ $i=1,\ldots,k$ , or* 2. 2.

$NL_{q}(u_{i})\subseteq NL_{G}(v_{k-i+1})$ * $\forall$ $i=1,\ldots,k$ . $\blacksquare$ *

The previous theorem presents the necessary condition required for a data graph path $p^{\prime}$ to be included in $Cand(p)$ , $p\in DP_{maxL}(q)$ . Applying this condition while constructing $Cand(p)$ would minimize $Cand(p)$ and cut down the search space of Fast-P.

Corollary 4.11.

In the case of non-iso path $p$ , the first test is sufficient. $\blacksquare$

To efficiently apply the inclusion tests in Fast-P algorithm, we construct a bit matrix similar to that is used with Fast-ON, $M_{DLN}=(m_{ij})_{\alpha\beta}$ (where $\alpha=|DLN_{q}|$ and $\beta=|DLN_{G}|$ ) and the same two pointers $P_{q}$ and $P_{G}$ as in the Fast-ON algorithm. The two tests in Theorem 4.10 are replaced by the following two tests:

$m_{{P_{q}(u_{i})}{P_{G}(v_{i})}}=1$ $\forall$ $i=1,\ldots,k$ , or 2. 2.

$m_{{P_{q}(u_{i})}{P_{G}(v_{k-i+1})}}=1$ $\forall$ $i=1,\ldots,k$ .

4.2.5 Opt3: Ordering $DP_{maxL}(q)$

Although $Cand(p_{i})$ is minimized for each $p_{i}\in DP_{maxL}(q)$ based on Opt2, the search order of the paths in $DP_{maxL}(q)$ is random, and can seriously slow down the algorithm. Query disjoint paths $DP_{maxL}(q)$ should be explored in the order that excludes false local matches of each path $p_{i}\in DP_{maxL}(q)$ as early as possible, saving much of the time that may be taken on false long partial mappings. A local match of path $p_{i}$ is false if it does not satisfy the preservation of structural connectivity. When we maximize the node overlapping of a currently processing query disjoint path $p_{i}\in DP_{maxL}(q)$ with the previously explored ones ( $p_{1},...,p_{i-1}$ ), we, in fact, maximize the connectivity among $p_{i}$ and the previously explored ones ( $p_{1},...,p_{i-1}$ ), and thus increase the likelihood that false local matches are detected early. Thus, we adopt an ordering of $DP_{maxL}(q)=\{p_{1},p_{2},\ldots,p_{|DP_{maxL}(q)|}\}$ , such that the node overlapping of $V_{p_{i}}$ is maximized with $\cup_{j<i}V_{p_{j}}$ . And, the first path $p_{1}$ is chosen such that $\sum_{u\in V_{p_{1}}}freq(u)$ is maximum, where $freq(u)$ is the frequency of the node $u$ with respect to $DP_{maxL}(q)$ . Figure 9 outlines the idea.

4.2.6 Fast-P Pseudocode

The pseudocode of Fast-P is similar to that of Fast-ON algorithm, except that paths are used instead of vertices. Figures 10 and 11 outline the pseudocode of Fast-P algorithm. The main difference between Fast-P and Fast-P codes is that a query vertex has only one image at a time in Fast-ON. But it could have more than one image in Fast-P. This is because the query vertex could appear in many query disjoint paths, and thus it has different images in the different candidate paths of the data graph. To overcome this in Fast-P, we combine candidate paths only if these paths have the same images of a given vertex. To implement this, two counters are used in Fast-P, one for each vertex $u\in V_{q}$ denoted by $u.Count$ and the other for each vertex $v\in V_{G}$ , denoted by $v.Count^{\prime}$ . If $u\in V_{q}$ is mapped to vertex $v\in V_{G}$ , denoted as $h[u]=v$ , then we increment $u.Count$ and $v.Count^{\prime}$ (Lines 5-7 in function Matchable [Figure 11]) by one and in the backtracking step, we decrement one from $u.Count$ and $v.Count^{\prime}$ (Lines 11-13 in Recursive_Search( $p_{i}$ ) algorithm [Figure 11]).

Regarding Figure 10, Lines 1-5 initialize for each vertex query and graph vertex its counter, and initialize for each vertex $u\in V_{q}$ its mapping by [math] ( $h[u]=0$ ). Lines 6-7 enumerate all simple paths of size up to $maxL$ in $q$ and $G$ respectively. Line 8 applies the first optimization (Opt1), whereas line 9 outlines the second optimization (Opt2). Lines 10-16 apply the third optimization (Opt3). Line 17 initializes the mapping ( $f$ ) that maps each path in $DP^{*}_{maxL}(q)$ to NULL.

The procedure $Recursive\_Search$ (Figure 11) matches a previously unmatched $p_{i}\in DP^{*}_{maxL}(q)$ over $Cand(p_{i})$ , and proceeds step-by-step by recursively matching the subsequent path $p_{i+1}$ over $Cand(p_{i+1})$ (lines 6-7), or sets Test to true value (line 8) and returns if every path $p_{i}\in DP^{*}_{maxL}(q)$ has counterpart in ${\cal P}_{G}$ (line 9). If $p_{i}$ exhausts all paths in $Cand(p_{i})$ and still cannot find matching, $Recursive\_Search$ backtracks to the previous state for further exploration (lines 10-13). In function $Matchable$ (Figure11), $p_{i}\in DP^{*}_{maxL}(q)$ is not mapped to $p^{\prime}$ in $Cand(p_{i})$ , if for each $j$ such that $1\leq j\leq|V_{p_{i}}|$ , the mapping $h[u]=v$ (where $u$ is $j$ -th vertex in ${p_{i}}^{c}$ and $v$ is $j$ -th vertex in ${p^{\prime}}^{c}$ ) is not satisfied. In this case the function Matchable return FALSE, otherwise the function Matchable return TRUE.

5 Experimental Evaluation

The experimental evaluation of the two algorithms, Fast-ON and Fast-P, are made using PC with Intel 3GHz dual Core CPU and 4G main memory and running Linux. The algorithms were implemented in standard C++ with STL library support and compiled with GNU GCC. To make the time measurements more reliable, no other applications were running on the machine while doing the experiments. In experiments, we consider vertex/edge labeled graphs and vertex labeled graphs.

The rest of this chapter is organized as follows. In Section 5.1, we present the datasets that are used in our evaluation. Effects of optimization methods are presented in Section 5.2.1. Finally, in the reminding sections, we present experimental results of the two algorithm (Fast-ON and Fast-P).

5.1 Datasets

5.1.1 Real Dataset

AIDS_10K. The first real dataset, referred to as AIDS_10k, consists of 10,000 graphs that are randomly drawn from the AIDS Antiviral screen database 111http://dtp.nci.gov/.. These graphs have 25 vertices and 27 edges on average. There are totally 62 distinct vertex labels in the dataset but the majority of these labels are C, O and N. The total number of distinct edge labels is 3.

Chem_1M. In order to study the scalability of Fast-ON and Fast-P against different dataset size, we use a large real chemical compound dataset, referred to as Chem_1M. Chem_1M is a subset of the PubChem database (ftp://ftp.ncbi.nlm.nih.gov/pubchem/), and consists of one million graphs. Chem_1M has 23.98 vertices and 25.76 edges on average. The number of distinct vertex and distinct edge labels are 81 and 3, respectively. For this study, we derive subsets from Chem_1M, each one consists of N graphs and called Chem_N dataset. Note that the Chem_1M is the same as that used in [Han et al. (2010)].

5.1.2 Synthetic Datasets

The synthetic datasets are generated using the synthetic graph data generator GraphGen [Cheng et al. (2007)]. The generator allows us to specify various parameters such as the average graph density D, graph size E and the number of distinct vertex/edge labels L. For example, Syn10K.E30.D5.L50 means that it contains 10,000 graph; the average size of each graph is 30; the density of each graph is 0.5; and the number of distinct vertex/edge labels is 50. Five synthetic datasets with varying parameter values are used in experiments in order to see performance changes with varying parameter values (Syn10K.E30.D3.L50, Syn10K.E30.D5.L50, Syn10K.E30.D7.L50, Syn10K.E30.D5.L80 and Syn10K.E30.D5.L20). Note that all the previous five synthetic datasets are dense dataset and are the same as in [Han et al. (2010)]. Also, we get another synthetic dataset from CT-index [Klein et al. (2011)]. This dataset is sparse dataset and we denote it by SynCT_10K.

5.1.3 Query Sets

For each dataset (real or synthetic), there are six query sets Q4, Q8, Q12, Q16, Q20 and Q24. Each Qi consists of 1000 queries, each of which of size $i$ . For AIDS_10K, Chem_1M, and the previous five synthetic datasets, we adopt the query set from [Han et al. (2010)]. For SynCT_10K, we adopt the query set from [Klein et al. (2011)].

5.2 Performance of Subgraph Checking Algorithms

5.2.1 Effects of Optimizations

In this section, we show the effect of each optimization on the performance of Fast-ON and Fast-P algorithms.

•

Effects of Optimizations in Fast-ON Algorithm

There are two optimizations, called Opt1 and Opt2, introduced in Fast-ON. In this experiment, we show the effect of each optimization independently, and the effect of them combined, on the performance of Fast-ON. For this purpose, we implemented three versions of Fast-ON, namely, Fast-O that uses only the first optimization Opt1, Fast-N that uses only the second optimization Opt2, and Fast-ON that uses both of the two optimizations.

Figure 12 plots the results obtained by running the three versions on AIDS_10K for the different query sets. The figure shows that Fast-N is faster than Fast-O except for Q12 and Q16, where Fast-O shows the best performance. In addition to its influence on speed, the first optimization makes the algorithm less sensitive to query size. Fast-ON shows the best performance, it outperforms both Fast-O and Fast-N. This result confirms the fact that the two optimizations are neither independent nor conflicting, but they are complementary to each other.

•

Effects of Optimizations in Fast-P Algorithm

In Fast-P Algorithm, there are three optimizations, called Opt1, Opt2, and Opt3. In this experiment, we show the effect of each optimization independently, and the effect of them combined, on the performance of Fast-P.

To show the effect of the first optimization (Opt1), we implemented two versions, namely, Fast-P(1-Edge) that sets $maxL=1$ and Fast-P(2-Edge) that sets $maxL=2$ . Also, we use Fast-ON algorithm and denote it here by Fast-P(Vertex) since in Fast-ON, we apply vertex-at-a-time-manner rather than path-at-a-time-manner. Figure 13(a) plots the results obtained by running Fast-P(2-Edge), Fast-P(1-Edge), and Fast-P(Vertex) on AIDS_10K for the different query sets. This figure shows that Fast-P(1-Edge) is faster than Fast-P(Vertex) except for Q4, where Fast-P(Vertex) shows the best performance. Fast-P(2-Edge) shows the best performance, it outperforms both Fast-P(Vertex) and Fast-P(1-Edge). This result is realistic since Fast-P(2-Edge) uses large-size local matches. Note that Fast-P(Vertex), Fast-P(1-Edge), and Fast-P(2-Edge) apply the remaining two optimizations (Opt2 and Opt3). In the following experiments, we denote Fast-P(2-Edge) by Fast-P.

To show the effect of the remaining two optimizations (Opt2 and Opt3), we implemented two versions, namely, Fast-P(N), that uses the second optimization (Opt2) only and Fast-P(O), that uses the third optimization (Opt3) only. Note that, we set $maxL=2$ for the two versions (i.e., the two versions apply the first optimization). Figure 13(b) plots the results obtained by running the two versions on AIDS_10K for the different query sets. This figure shows that Fast-P(O) is faster than Fast-P(N) except for Q4 and Q8, where Fast-P(N) shows the best performance. Note that, the third optimization (Opt3) makes the algorithm (Fast-P(O)) less sensitive to query size. Fast-P shows the best performance, it outperforms both Fast-P(N) and Fast-P(O).

The previous results confirm the fact that the three optimizations in Fast-P are neither independent nor conflicting, but they are complementary to each other.

5.2.2 Fast-ON vs. Fast-P

In this section, we demonstrate the efficiency of our two subgraph isomorphism algorithms Fast-ON and Fast-P on sparse datasets (the graphs have small density) and on dense datasets (the graphs have high density) as follows.

•

Performance on Sparse Datasets

In this experiment, we test the performance of Fast-ON and Fast-P on the sparse datasets AIDS_10K, Chem_200K, and SynCT_10K. Figure 14 reports the results on these datasets. From this figure, the Fast-P algorithm always spends less response time compared with Fast-ON algorithm with a factor up to 2. In the following experiments, for sparse datasets, we will use Fast-P.

•

Performance on Dense Datasets

In this experiment, we test the performance of Fast-ON, Fast-P(1-Edge), and Fast-P on the five dense datasets Syn10K.E30.D3.L50, Syn10K.E30.D5.L50,

Syn10K.E30.D7.L50, Syn10K.E30.D5.L80 and Syn10K.E30.D5.L20. Figure 15 reports the results on the five datasets. From this figure, we found that Fast-P algorithm is the worst one since both Fast-ON and Fast-P(1-Edge) significantly outperform Fast-P algorithm. Roughly, both Fast-ON and Fast-P(1-Edge) have the same response time on the five datasets. In the following experiments, we will use Fast-ON for dense datasets. Note that the performance gain of Fast-ON against Fast-P dramatically increases when the density increases. This result is occurred for the following two reasons. The first one is due the cost of inclusion tests in Fast-P since we can not use the distinct neighborhood strategy with dense datasets. The second reason is the large number of compatible paths to each query path.

In the next experiments, the two algorithms Fast-ON and Fast-P are tested against the state-of-the-art subgraph isomorphism algorithms like Ullman (we implemented it using standard C++ with STL library support), QuickSI (we obtained its executable from the authors) and Vflib (we downloaded it from http://amalfi.dis.unina.it/graph/db/vflib-2.0).

5.2.3 Fast-P vs. Ullman, Vflib, and QuickSI on Sparse

Datasets

In this experiment, we demonstrate the efficiency of our subgraph isomorphism testing algorithm Fast-P against Ullman and Vflib algorithms on labeled sparse datasets and against Ullman, Vflib, and QuickSI (works with unlabeled edges datasets only) algorithms on unlabeled sparse datasets as follows.

•

On Labeled Sparse Datasets

Here, we evaluate the performance of Fast-P on AIDS_10K, Chem_10K, and SynCT_10K datasets by comparing it with the two algorithms Ullman and Vflib. Total response time for each query set of the three datasets is recorded in Figure 17. For the two datasets Chem_10K and SynCT_10K, Ullman is faster than Vflib while Vflib outperforms Ullman on AIDS_10K except for Q4. Fast-P shows the best performance, it outperforms both Ullman and Vflib on the three dataset with a wide margin.

•

On Unlabeled Sparse Datasets

Here, we We used the two sparse datasets AIDS_10K and Chem_10K after removing the edge labels and we denoted them as Unlabeled AIDS_10K and Unlabeled Chem_10K. Figure 16 reports the results on the two datasets. From this figure, QuickSI outperforms Ullman and Vflib on the two datasets. Also, Fast-P shows the best performance, it outperforms Ullman, Vflib, and QuickSI on AIDS_10K dataset by more than two order of magnitude, more than one order of magnitude, and three factors, respectively (Note that Ullman is not shown for the query sets, namely, Q16, Q20, and Q24 since it failed to run on our machine). On Chem_10K dataset, Fast-P outperforms Ullman, Vflib, and QuickSI by one order of magnitude, more than two order of magnitud, and 4 factors, respectively.

5.2.4 Fast-ON vs. Ullman, Vflib, and QuickSI on Dense

Datasets

In this experiment, we demonstrate the efficiency of our subgraph isomorphism testing algorithm Fast-ON against Ullman and Vflib algorithms on labeled dense datasets and against Ullman, Vflib, and QuickSI (works with unlabeled edges datasets only) algorithms on unlabeled dense datasets as follows.

•

On Labeled Dense Datasets

In this subsection, we evaluate the performance of Fast-ON on the five dense datasets Syn10K.E30.D3.L50, Syn10K.E30.D5.L50, Syn10K.E30.D7.L50, Syn10K.E30.D5.L80 and Syn10K.E30.D5.L20 by comparing it with the two algorithms Ullman and Vflib. Total response time for each query set of the five datasets is recorded and demonstrated in the Figure 18. From this figure, Ullman is faster than Vflib by a large margin and Fast-ON shows the best performance, it outperforms Ullman and Vflib on the five labeled dense datasets by up to 3 factors and more than two order of magnitude, respectively.

•

On Unlabeled Dense Datasets

Here, we used the three dense datasets Syn10K.E30.D3.L50, Syn10K.E30.D5.L50, and Syn10K.E30.D5.L20, after removing the edge labels and we denoted them as Unlabeled Syn10K.E30.D3.L50, Unlabeled Syn10K.E30.D5.L50 and Unlabeled Syn10K.E30.D5.L20. Total response time for each query set of the three datasets is recorded and demonstrated in the Figure 19. From this figure, Ullman outperforms Vflib and QuickSI on the three datasets, Vflib is the worst one, and Fast-ON shows the best performance, it outperforms Ullman, Vflib, and QuickSI on the three unlabeled dense datasets by up to 3 factors, more than two order of magnitude, and more than one order of magnitude, respectively.

5.2.5 Scalability

In this experiment, we show the scalability of Ullman, Vflib, Fast-ON, and Fast-P on labeled sparse datasets and the scalability of Ullman, Vflib, QuickSI, Fast-ON, and Fast-P on unlabeled sparse datasets as follows.

•

On Labeled Sparse Datasets

Figure 20 shows the scalability of Ullman, Vflib, Fast-ON, and Fast-P with respect to the number of graphs using the labeled sparse dataset Chem_1M and the labeled query set $Q8$ . The figure shows that the four algorithms scale linearly. However, Fast-ON outperforms Ullman by factor three, and Vflib by more than one order of magnitude. Moreover, Vflib is the worst one and it is not shown for 1000K graphs, since it failed to run on large datasets. The figure also shows that Fast-P has the best performance, it outperforms Ullman, Vflib, Fast-ON by up to one order of magnitude, more than two order of magnitude, and up to two factors, respectively.

•

On Unlabeled Sparse Datasets

In this subsection, we used the Chem_1M dataset and the query set Q8 after removing the edge labels and we denoted them as Unlabeled Chem_1M and Unlabeled Q8. Figure 21 shows the scalability of Ullman, Vflib, QuickSI, Fast-ON, and Fast-P with respect to the number of graphs using the sparse dataset Unlabeled Chem_1M and the query set Unlabeled $Q8$ . The figure shows that the five algorithms scale linearly. However, Fast-ON outperforms Ullman by factor five, Vflib by more than one order of magnitude, and QuickSI by up to two factors . Moreover, Vflib is the worst one. Note that Vflib and QuickSI are not shown for 1000K graphs, since they failed to run on large datasets. The figure also shows that Fast-P has the best performance, it outperforms Ullman, Vflib, QuickSI, and Fast-ON by up to one order of magnitude, up to two order of magnitude, up to four factors, and up to two factors, respectively.

6 Conclusion

This paper presented two improvements to the Ullmann algorithm, a well-known subgraph isomorphism checker, named Fast-on and Fast-p. Fast-on improves Ullman by reducing its search space using first a refined vertex matching process and second a new search ordering methodology. Fast-p, on the other hand, is a path-at-a-time matching, leverages structure instead of vertex matching, and uses efficient path ordering methodology to reduce the search space. Experiments show that significant improvements, up to four orders of magnitude, are achieved.

Bibliography12

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1]
2Cai et al . (2005) D. Cai, Z. Shao, X. He, X. Yan, and J. Han. 2005. Community mining from multi-relational networks. Proc. of PKDD (2005).
3Cheng et al . (2007) J. Cheng, Y. Ke, W. Ng, and A. Lu. 2007. Fg-index: towards verification-free query processing on graph databases. SIGMOD (2007), 857–872.
4Cordella et al . (2004) L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. 2004. A (sub)graph isomorphism algorithm for matching large graphs. IEEE transaction on pattern analysis and machine intelligence 26(10) (2004), 1367–1372.
5Garey and Johnson (1990) M. R. Garey and D. S. Johnson. 1990. Computers and Intractability; Guide to the theory of NP-Completeness. W. H. Freeman & Co. (1990).
6Gouda and Hassaan (2012) K. Gouda and M. Hassaan. 2012. A Fast Algorithm for Subgraph Search Problem. INFOS (2012), 53–59.
7Han et al . (2010) W.-S. Han, J. Lee, M.-D. Pham, and J. X. Yu. 2010. igraph: a framework for comparisons of disk-based graph indexing techniques. PVLDB (2010), 449–459.
8Klein et al . (2011) Karsten Klein, Nils Kriege, and Petra Mutzel. 2011. CT-Index: Fingerprint-based Graph Indexing Combining Cycles and Trees. ICDE (2011), 1115–1126.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

New Subgraph Isomorphism Algorithms: Vertex versus

Abstract

keywords:

1 Introduction

2 Preliminaries

Definition 2.1**.**

Definition 2.2**.**

Example 1

3 Related Work

Definition 3.1**.**

3.1 Ullman Algorithm

3.2 QuickSI Algorithm

3.3 Vflib Algorithm

4 New Subgraph Isomorphism Algorithms

4.1 Fast-ON Algorithm

4.1.1 Opt1: Utilizing Neighborhood Labels

Definition 4.1**.**

Theorem 4.2**.**

Example 2

Example 3

4.1.2 Opt2: Ordering the query vertices

4.1.3 Fast-ON Pseudocode

4.2 Fast-P Algorithm

4.2.1 Path Enumeration and Encoding in Fast-P

Definition 4.3**.**

Definition 4.4**.**

Example 4

Lemma 4.5**.**

Example 5

Definition 4.6**.**

Corollary 4.7**.**

Example 6

4.2.2 Path Matching in Fast-P

Definition 4.8**.**

Example 7

4.2.3 Opt1: Minimizing ∣DPmaxL(q)∣|DP_{maxL}(q)|∣DPmaxL​(q)∣

Theorem 4.9**.**

Example 8

Example 9

4.2.4 Opt2: Minimizing ∣Cand(pi)∣|Cand(p_{i})|∣Cand(pi​)∣

Theorem 4.10**.**

Corollary 4.11**.**

4.2.5 Opt3: Ordering DPmaxL(q)DP_{maxL}(q)DPmaxL​(q)

4.2.6 Fast-P Pseudocode

5 Experimental Evaluation

5.1 Datasets

5.1.1 Real Dataset

5.1.2 Synthetic Datasets

5.1.3 Query Sets

5.2 Performance of Subgraph Checking Algorithms

5.2.1 Effects of Optimizations

5.2.2 Fast-ON vs. Fast-P

5.2.3 Fast-P vs. Ullman, Vflib, and QuickSI on Sparse

5.2.4 Fast-ON vs. Ullman, Vflib, and QuickSI on Dense

5.2.5 Scalability

6 Conclusion

Definition 2.1.

Definition 2.2.

Definition 3.1.

Definition 4.1.

Theorem 4.2.

Definition 4.3.

Definition 4.4.

Lemma 4.5.

Definition 4.6.

Corollary 4.7.

Definition 4.8.

4.2.3 Opt1: Minimizing $|DP_{maxL}(q)|$

Theorem 4.9.

4.2.4 Opt2: Minimizing $|Cand(p_{i})|$

Theorem 4.10.

Corollary 4.11.

4.2.5 Opt3: Ordering $DP_{maxL}(q)$