RadegastXDB - Prototype of Native XML Database Management System:   Technical Report

Petr Luk\'a\v{s}; Radim Ba\v{c}a; Michal Kr\'atk\'y

arXiv:1903.03761·cs.DB·March 15, 2019

RadegastXDB - Prototype of Native XML Database Management System: Technical Report

Petr Luk\'a\v{s}, Radim Ba\v{c}a, Michal Kr\'atk\'y

PDF

Open Access

TL;DR

This paper presents RadegastXDB, a native XML database system that incorporates twig pattern query detection to improve the efficiency of processing structural XQueries, outperforming existing XML DBMSs especially on large datasets.

Contribution

Introduction of RadegastXDB, a prototype XML DBMS that integrates twig pattern query detection and advanced algorithms to enhance query processing performance.

Findings

01

RadegastXDB outperforms current XML DBMSs on structural queries.

02

State-of-the-art TPQ algorithms improve query speed on large datasets.

03

Efficient processing of queries with value predicates using the proposed techniques.

Abstract

A lot of advances in the processing of XML data have been proposed in last two decades. There were many approaches focused on the efficient processing of twig pattern queries (TPQ). However, including the TPQ into an XQuery compiler is not a straightforward task and current XML DBMSs process XQueries without any TPQ detection. In this paper, we demonstrate our prototype of a native XML DBMS called RadegastXDB that uses a TPQ detection to accelerate structural XQueries. Such a detection allows us to utilize state-of-the-art TPQ processing algorithms. Our experiments show that, for the structural queries, these algorithms and state-of-the-art XML indexing techniques make our prototype faster than all of the current XML DBMSs, especially for large data collections. We also show that using the same techniques is also efficient for the processing of queries with value predicates.

Tables3

Table 1. Table 1: DBMSs included in experiments

Oracle Berkley DB 6.1.4 (B-DB) www.oracle.com/technetwork/database/database-technologies/berkeleydb/overview/index.html
Virtuoso 7.1 (VRT) virtuoso.openlinksw.com
eXist-db 4.3.1 (E-DB) exist-db.org
BaseX 9.0.2 (BX) basex.org
MonetDB XQuery 4 (M-DB) www.monetdb.org/XQuery
Commercial XML DBMS (CX)
Commercial relational DBMS 1 and 2 (CR1, CR2)

Table 2. Table 2: Statistics of data collections

Collection	Size (MB)	XML nodes	Max. depth
XMark (f=1)	111	2,048,193	14
XMark (f=10)	1,137	20,532,805	14
SwissProt	109	5,166,890	7
TreeBank	82	2,437,667	38
DBLP	127	3,736,406	6

	Structural queries									Queries with value predicates
	GTP	CTJ	FP-BJ	B-DB	VRT	BX	M-DB	CR1	CR2	GTP	FP-BJ	B-DB	VRT	BX	M-DB	CR1	CR2
XM1	0.010	0.011	0.002	0.265	4.422	0.112	0.030	63.655	DNF	0.004	0.008	0.474	4.213	0.347	0.145	93.868	DNF
XM2	0.041	0.041	0.016	0.870	4.979	0.358	0.094	DNF	DNF	0.023	0.023	0.193	4.156	0.981	0.111	9.710	DNF
XM3	0.007	0.003	0.010	0.078	4.318	0.013	0.036	0.054	0.031	0.000	0.000	0.021	4.094	0.004	0.139	2.140	DNF
XM4	0.041	0.041	0.016	0.740	4.922	0.137	0.062	DNF	DNF	0.010	0.010	0.084	3.984	2.508	0.134	3.510	DNF
XM5	0.034	0.033	0.008	0.563	4.630	0.223	0.045	223.361	DNF	0.006	0.002	0.333	3.990	0.168	0.256	DNF	DNF
XM1	0.108	0.105	0.025	2.427	58.573	0.972	0.106	DNF	DNF	0.031	0.068	9.120	50.057	3.479	0.665	DNF	DNF
XM2	0.423	0.423	0.141	8.450	66.136	3.079	0.621	DNF	DNF	0.236	0.211	1.677	51.078	3.377	0.548	DNF	DNF
XM3	0.071	0.032	0.078	0.250	59.500	0.087	0.063	DNF	0.250	0.006	0.000	0.021	50.141	0.005	0.528	16.494	DNF
XM4	0.436	0.426	0.141	8.146	71.620	1.328	0.355	DNF	DNF	0.121	0.117	0.370	51.349	9.451	0.516	253.375	DNF
XM5	0.338	0.337	0.074	5.468	66.354	1.857	0.195	DNF	DNF	0.041	0.041	5.292	51.588	1.547	1.073	DNF	DNF
SP1	0.019	0.006	0.016	0.042	9.828	1.030	0.095	DNF	10.083	0.000	0.000	0.021	9.828	0.002	0.166	6.197	DNF
SP2	0.210	0.199	0.100	3.943	15.156	1.708	0.170	DNF	DNF	0.002	0.002	3.500	10.224	1.103	0.254	0.434	DNF
SP3	0.042	0.042	0.020	1.656	10.375	0.675	0.136	DNF	DNF	0.063	0.055	4.933	10.135	1.280	DNF	DNF	DNF
SP4	0.021	0.012	0.016	0.062	9.719	0.574	0.052	19.716	0.297	0.000	0.000	1.026	9.776	1.521	0.573	1.303	DNF
SP5	0.338	0.306	0.156	9.073	16.797	2.821	0.150	DNF	DNF	0.041	0.020	0.313	9.411	0.276	0.963	54.766	DNF
TB1	0.012	0.002	0.012	0.469	6.073	DNF	0.048	DNF	130.187
TB2	0.016	0.014	0.016	78.651	6.375	DNF	0.091	DNF	DNF
TB3	0.012	0.012	0.016	0.766	5.255	DNF	0.059	DNF	DNF
TB4	0.068	0.068	0.018	0.740	5.359	DNF	0.058	DNF	0.109
TB5	0.099	0.100	0.043	6.136	6.672	DNF	0.088	DNF	DNF
DB1	0.066	0.064	0.031	3.693	10.089	1.435	0.084	DNF	DNF	0.012	0.006	8.338	9.214	0.002	0.154	61.869	DNF
DB2	0.011	0.011	0.010	0.047	8.672	0.511	0.035	14.162	0.141	0.000	0.000	0.922	8.870	0.028	0.208	26.061	DNF
DB3	0.022	0.021	0.016	1.875	9.031	0.881	0.043	DNF	DNF	0.016	0.008	6.000	9.011	0.804	0.269	28.120	DNF
DB4	0.188	0.184	0.059	5.031	10.036	1.741	0.227	DNF	DNF	0.010	0.012	3.521	9.078	0.005	0.208	4.603	DNF
DB5	0.040	0.039	0.014	0.411	9.015	1.081	0.061	DNF	7.630	0.016	0.018	1.531	9.302	0.112	0.234	176.474	DNF

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Database Systems and Queries · Data Management and Algorithms · Data Mining Algorithms and Applications

Full text

RadegastXDB – Prototype of Native XML Database Management System: Technical Report

Petr Lukáš [email protected]

Radim Bača [email protected]

Michal Krátký [email protected]

Department of Computer Science

Faculty of Electrical Engineering and Computer Science

VSB – Technical University of Ostrava

Abstract

A lot of advances in the processing of XML data have been proposed in last two decades. There were many approaches focused on the efficient processing of twig pattern queries (TPQ). However, including the TPQ into an XQuery compiler is not a straightforward task and current XML DBMSs process XQueries without any TPQ detection. In this paper, we demonstrate our prototype of a native XML DBMS called RadegastXDB that uses a TPQ detection to accelerate structural XQueries. Such a detection allows us to utilize state-of-the-art TPQ processing algorithms. Our experiments show that, for the structural queries, these algorithms and state-of-the-art XML indexing techniques make our prototype faster than all of the current XML DBMSs, especially for large data collections. We also show that using the same techniques is also efficient for the processing of queries with value predicates.

1 Introduction

A lot of advances in the processing of XML data have been proposed in last two decades. Especially in 2000 – 2010, there were many approaches focused on an efficient processing of XQueries modeled by twig pattern queries (TPQ) (e.g., [19, 2, 6, 16, 7, 15, 14, 17]). In general, there are two major groups of TPQ processing algorithms: binary structural joins [2, 1, 16, 10] and holistic twig joins [6, 7, 14, 4], where the latter group is considered as the state-of-the-art. However, the most of the current XML database management systems (DBMSs) do not utilize holistic twig joins, since these DBMSs are not capable to detect TPQs in XQueries. Instead, they rely on rather naive techniques such as nested loops or the traditional relational merge join algorithms. In other words, they ignore the most of the advances in the XML query processing introduced in last two decades and, therefore, they perform poorly even on simple structural queries on large data collections.

A TPQ is a rooted labeled tree, where each node corresponds to one location step in an XQuery. A sample TPQ is illustrated in Figure 1b and it corresponds to the XQuery in Figure 1a. The single and double lined edges represent the parent-child (PC) and ancestor-descendant (AD) structural relationships corresponding to the child and descendant axes, respectively111For the sake of simplicity, we consider only these two XPath axes as far as it is common for the most of the XML query processing approaches.. We call query nodes the nodes in a TPQ and we denote them by the ‘#’ character. Additionally, the circled query nodes represent output query nodes (also called extraction points [12]) which correspond to the last location steps in the ‘for’ clauses. In a nutshell, the processing of a TPQ means to find all mappings from the TPQ to an XML document such that the query nodes are mapped to XML nodes of the corresponding name and these XML nodes satisfy the relationships specified by the query edges. For more details about the processing of a TPQ, we refer to [3].

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Al-Khalifa and H. Jagadish. Multi-level operator combination in xml query processing. In Proceedings of the eleventh international conference on Information and knowledge management , pages 134–141. ACM, 2002.
2[2] S. Al-Khalifa, H. V. Jagadish, N. Koudas, J. M. Patel, D. Srivastava, and Y. Wu. Structural joins: A primitive for efficient xml query pattern matching. In Proceedings 18th International Conference on Data Engineering , pages 141–152. IEEE, 2002.
3[3] R. Bača, M. Krátký, I. Holubová, M. Nečaský, T. Skopal, M. Svoboda, and S. Sakr. Structural xml query processing. ACM Computing Surveys (CSUR) , 50(5):64, 2017.
4[4] R. Bača, M. Krátký, T. W. Ling, and J. Lu. Optimal and efficient generalized twig pattern processing: a combination of preorder and postorder filterings. The VLDB Journal—The International Journal on Very Large Data Bases , 22(3):369–393, 2013.
5[5] R. Bača, P. Lukáš, and M. Krátký. Cost-based holistic twig joins. Information Systems , 52:21–33, 2015.
6[6] N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: optimal xml pattern matching. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data , pages 310–321. ACM, 2002.
7[7] S. Chen, H.-G. Li, J. Tatemura, W.-P. Hsiung, D. Agrawal, and K. S. Candan. Twig 2 stack: bottom-up processing of generalized-tree-pattern queries over xml documents. In Proceedings of the 32nd international conference on Very large data bases , pages 283–294. VLDB Endowment, 2006.
8[8] Z. Chen, H. Jagadish, L. V. Lakshmanan, and S. Paparizos. From tree patterns to generalized tree patterns: On efficient evaluation of xquery. In Proceedings of the 29th international conference on Very large data bases-Volume 29 , pages 237–248. VLDB Endowment, 2003.