Rewritability in Monadic Disjunctive Datalog, MMSNP, and Expressive   Description Logics

Cristina Feier; Antti Kuusisto; Carsten Lutz

arXiv:1701.02231·cs.LO·June 22, 2023

Rewritability in Monadic Disjunctive Datalog, MMSNP, and Expressive Description Logics

Cristina Feier, Antti Kuusisto, Carsten Lutz

PDF

TL;DR

This paper investigates the decidability and complexity of rewriting problems for monadic disjunctive Datalog, MMSNP, and description logic-based queries, providing new constructions and complexity bounds.

Contribution

It establishes decidability and complexity results for rewritability into FO, Datalog, and MDLog, and introduces a new canonical Datalog construction applicable to formulas with free variables.

Findings

01

Rewritability into FO and monadic Datalog is decidable.

02

Rewritability into Datalog is decidable under certain equality conditions.

03

Complexity is 2NExpTime-complete for most cases, with some gaps remaining.

Abstract

We study rewritability of monadic disjunctive Datalog programs, (the complements of) MMSNP sentences, and ontology-mediated queries (OMQs) based on expressive description logics of the ALC family and on conjunctive queries. We show that rewritability into FO and into monadic Datalog (MDLog) are decidable, and that rewritability into Datalog is decidable when the original query satisfies a certain condition related to equality. We establish 2NExpTime-completeness for all studied problems except rewritability into MDLog for which there remains a gap between 2NExpTime and 3ExpTime. We also analyze the shape of rewritings, which in the MMSNP case correspond to obstructions, and give a new construction of canonical Datalog programs that is more elementary than existing ones and also applies to formulas with free variables.

Equations72

β_{1} \lor \dots \lor β_{n} \leftarrow α_{1} \land \dots \land α_{m}

β_{1} \lor \dots \lor β_{n} \leftarrow α_{1} \land \dots \land α_{m}

S_{1} (x_{1}) \lor \dots \lor S_{m} (x_{m}) \leftarrow R_{1} (y_{1}) \land \dots \land R_{n} (y_{n})

S_{1} (x_{1}) \lor \dots \lor S_{m} (x_{m}) \leftarrow R_{1} (y_{1}) \land \dots \land R_{n} (y_{n})

\begin{array}[]{r@{\;}c@{\;}l}q(x_{1},x_{2},x_{3})&=&r(x_{1},x_{2})\wedge r(x_{2},x_{3})\wedge r(x_{3},x_{1}),\end{array}

\begin{array}[]{r@{\;}c@{\;}l}q(x_{1},x_{2},x_{3})&=&r(x_{1},x_{2})\wedge r(x_{2},x_{3})\wedge r(x_{3},x_{1}),\end{array}

R_{q (x)} (a, a^{'}, c^{'}), R_{q (x)} (b, b^{'}, a^{'}), R_{q (x)} (c, c^{'}, b^{'}) .

R_{q (x)} (a, a^{'}, c^{'}), R_{q (x)} (b, b^{'}, a^{'}), R_{q (x)} (c, c^{'}, b^{'}) .

r (a, a^{'}), r (a^{'}, c^{'}), r (c^{'}, a), r (b, b^{'}), r (b^{'}, a^{'}), r (a^{'}, b), r (c, c^{'}), r (c^{'}, b), r (b^{'}, c) .

r (a, a^{'}), r (a^{'}, c^{'}), r (c^{'}, a), r (b, b^{'}), r (b^{'}, a^{'}), r (a^{'}, b), r (c, c^{'}), r (c^{'}, b), r (b^{'}, c) .

P_{0} (x_{0}) \leftarrow P_{1} (x_{1}) \land \dots \land P_{n} (x_{n}) \land q (y)

P_{0} (x_{0}) \leftarrow P_{1} (x_{1}) \land \dots \land P_{n} (x_{n}) \land q (y)

P_{0} (x_{0}) \leftarrow P_{1} (x_{1}) \land \dots \land P_{n} (x_{n}) \land q^{'} (y^{'}) .

P_{0} (x_{0}) \leftarrow P_{1} (x_{1}) \land \dots \land P_{n} (x_{n}) \land q^{'} (y^{'}) .

I_{D} = {R_{q_{v} (x)} (dom (I_{v})) ∣ v \in V}

I_{D} = {R_{q_{v} (x)} (dom (I_{v})) ∣ v \in V}

P (x) \land eq (x, y) \to P (y) and P (y) \land eq (x, y) \to P (x)

P (x) \land eq (x, y) \to P (y) and P (y) \land eq (x, y) \to P (x)

eq (b_{p}, b_{p, p^{'}, 1}), eq (b_{p, p^{'}, 1}, b_{p, p^{'}, 2}), \dots, eq (b_{p, p^{'}, g - 1}, b_{p, p^{'}, g}), eq (b_{p, p^{'}, g}, b_{p^{'}}) .

eq (b_{p}, b_{p, p^{'}, 1}), eq (b_{p, p^{'}, 1}, b_{p, p^{'}, 2}), \dots, eq (b_{p, p^{'}, g - 1}, b_{p, p^{'}, g}), eq (b_{p, p^{'}, g}, b_{p^{'}}) .

Ω := (S_{E} \times {1, \dots, m}) \cup (S_{E} \times {1, \dots, m} \times S_{E} \times {1, \dots, m} \times {1, \dots, g})

Ω := (S_{E} \times {1, \dots, m}) \cup (S_{E} \times {1, \dots, m} \times S_{E} \times {1, \dots, m} \times {1, \dots, g})

\begin{array}[]{rcl}P_{0}(\mathbf{x}_{0})&\leftarrow&P_{1}(\mathbf{x}_{1})\wedge\cdots\wedge P_{\ell_{1}}(\mathbf{x}_{\ell_{1}})\,\wedge\\[2.84526pt] &&R_{1}(\mathbf{y}_{1})\wedge\cdots\wedge R_{\ell_{2}}(\mathbf{y}_{\ell_{2}})\,\wedge\\[2.84526pt] &&{\mathtt{eq}}(z_{1,1},z_{1,2})\wedge\cdots\wedge{\mathtt{eq}}(z_{\ell_{3},1},z_{\ell_{3},2})\end{array}

\begin{array}[]{rcl}P_{0}(\mathbf{x}_{0})&\leftarrow&P_{1}(\mathbf{x}_{1})\wedge\cdots\wedge P_{\ell_{1}}(\mathbf{x}_{\ell_{1}})\,\wedge\\[2.84526pt] &&R_{1}(\mathbf{y}_{1})\wedge\cdots\wedge R_{\ell_{2}}(\mathbf{y}_{\ell_{2}})\,\wedge\\[2.84526pt] &&{\mathtt{eq}}(z_{1,1},z_{1,2})\wedge\cdots\wedge{\mathtt{eq}}(z_{\ell_{3},1},z_{\ell_{3},2})\end{array}

\begin{array}[]{rcl}P^{\mu_{0}}_{0}(\mathbf{x}^{\prime}_{0})&\leftarrow&P^{\mu_{1}}_{1}(\mathbf{x}^{\prime}_{1})\wedge\cdots\wedge P^{\mu_{\ell_{1}}}_{\ell_{1}}(\mathbf{x}^{\prime}_{\ell_{1}})\,\wedge\\[2.84526pt] &&R_{1}(\mathbf{y}_{1})\wedge\cdots\wedge R_{\ell_{2}}(\mathbf{y}_{\ell_{2}})\,\wedge\\[2.84526pt] &&W\end{array}

\begin{array}[]{rcl}P^{\mu_{0}}_{0}(\mathbf{x}^{\prime}_{0})&\leftarrow&P^{\mu_{1}}_{1}(\mathbf{x}^{\prime}_{1})\wedge\cdots\wedge P^{\mu_{\ell_{1}}}_{\ell_{1}}(\mathbf{x}^{\prime}_{\ell_{1}})\,\wedge\\[2.84526pt] &&R_{1}(\mathbf{y}_{1})\wedge\cdots\wedge R_{\ell_{2}}(\mathbf{y}_{\ell_{2}})\,\wedge\\[2.84526pt] &&W\end{array}

goal () \leftarrow P (x_{1}, x_{2}) \land R (x_{1}, x_{2}) \land eq (x_{2}, x_{3})

goal () \leftarrow P (x_{1}, x_{2}) \land R (x_{1}, x_{2}) \land eq (x_{2}, x_{3})

\begin{array}[]{rcl}{\mathtt{goal}}()&\leftarrow&P^{\mu}(x_{1},x_{2},x_{1},x_{2},u_{1},u_{2})\wedge R(x_{1},x_{2})\,\wedge\\[2.84526pt] &&R(x_{1},w_{1})\wedge R(w_{2},x_{2})\wedge R(w_{3},x_{3})\wedge R(x_{3},w_{4})\end{array}

\begin{array}[]{rcl}{\mathtt{goal}}()&\leftarrow&P^{\mu}(x_{1},x_{2},x_{1},x_{2},u_{1},u_{2})\wedge R(x_{1},x_{2})\,\wedge\\[2.84526pt] &&R(x_{1},w_{1})\wedge R(w_{2},x_{2})\wedge R(w_{3},x_{3})\wedge R(x_{3},w_{4})\end{array}

\begin{array}[]{rcl}q(\mathbf{x})&=&P_{1}(\mathbf{x}_{1})\wedge\cdots\wedge P_{\ell_{1}}(\mathbf{x}_{\ell_{1}})\,\wedge\\[2.84526pt] &&R_{1}(\mathbf{y}_{1})\wedge\cdots\wedge R_{\ell_{2}}(\mathbf{y}_{\ell_{2}})\,\wedge\\[2.84526pt] &&{\mathtt{eq}}(z_{1,1},z_{1,2})\wedge\cdots\wedge{\mathtt{eq}}(z_{\ell_{3},1},z_{\ell_{3},2})\end{array}

\begin{array}[]{rcl}q(\mathbf{x})&=&P_{1}(\mathbf{x}_{1})\wedge\cdots\wedge P_{\ell_{1}}(\mathbf{x}_{\ell_{1}})\,\wedge\\[2.84526pt] &&R_{1}(\mathbf{y}_{1})\wedge\cdots\wedge R_{\ell_{2}}(\mathbf{y}_{\ell_{2}})\,\wedge\\[2.84526pt] &&{\mathtt{eq}}(z_{1,1},z_{1,2})\wedge\cdots\wedge{\mathtt{eq}}(z_{\ell_{3},1},z_{\ell_{3},2})\end{array}

\begin{array}[]{rcl}P^{\mu_{0}}_{0}(\mathbf{x}^{\prime}_{0})&\leftarrow&P^{\mu_{1}}_{1}(\mathbf{x}^{\prime}_{1})\wedge\cdots\wedge P^{\mu_{\ell_{1}}}_{\ell_{1}}(\mathbf{x}^{\prime}_{\ell_{1}})\,\wedge\\[2.84526pt] &&R_{1}(\mathbf{y}_{1})\wedge\cdots\wedge R_{\ell_{2}}(\mathbf{y}_{\ell_{2}})\,\wedge\\[2.84526pt] &&W\end{array}

\begin{array}[]{rcl}P^{\mu_{0}}_{0}(\mathbf{x}^{\prime}_{0})&\leftarrow&P^{\mu_{1}}_{1}(\mathbf{x}^{\prime}_{1})\wedge\cdots\wedge P^{\mu_{\ell_{1}}}_{\ell_{1}}(\mathbf{x}^{\prime}_{\ell_{1}})\,\wedge\\[2.84526pt] &&R_{1}(\mathbf{y}_{1})\wedge\cdots\wedge R_{\ell_{2}}(\mathbf{y}_{\ell_{2}})\,\wedge\\[2.84526pt] &&W\end{array}

\begin{array}[]{rcl}P_{0}(\mathbf{x}_{0})&\leftarrow&P_{1}(\mathbf{x}_{1})\wedge\cdots\wedge P_{\ell_{1}}(\mathbf{x}_{\ell_{1}})\,\wedge\\[2.84526pt] &&R_{1}(\mathbf{y}_{1})\wedge\cdots\wedge R_{\ell_{2}}(\mathbf{y}_{\ell_{2}})\,\wedge\\[2.84526pt] &&{\mathtt{eq}}(z_{1,1},z_{1,2})\wedge\cdots\wedge{\mathtt{eq}}(z_{\ell_{3},1},z_{\ell_{3},2})\end{array}

\begin{array}[]{rcl}P_{0}(\mathbf{x}_{0})&\leftarrow&P_{1}(\mathbf{x}_{1})\wedge\cdots\wedge P_{\ell_{1}}(\mathbf{x}_{\ell_{1}})\,\wedge\\[2.84526pt] &&R_{1}(\mathbf{y}_{1})\wedge\cdots\wedge R_{\ell_{2}}(\mathbf{y}_{\ell_{2}})\,\wedge\\[2.84526pt] &&{\mathtt{eq}}(z_{1,1},z_{1,2})\wedge\cdots\wedge{\mathtt{eq}}(z_{\ell_{3},1},z_{\ell_{3},2})\end{array}

p_{v} (y_{v}) \lor Q_{v} (x_{v}) \leftarrow q (x) ∣_{B_{v}} \land v ≺ v^{'} ⋀ Q_{v^{'}} (x_{v^{'}})

p_{v} (y_{v}) \lor Q_{v} (x_{v}) \leftarrow q (x) ∣_{B_{v}} \land v ≺ v^{'} ⋀ Q_{v^{'}} (x_{v^{'}})

P_{1} (x) \lor P_{2} (z) \leftarrow R (x, y_{1}) \land S (x, y_{2}) \land R (y_{1}, z) \land R (y_{2}, z)

P_{1} (x) \lor P_{2} (z) \leftarrow R (x, y_{1}) \land S (x, y_{2}) \land R (y_{1}, z) \land R (y_{2}, z)

\begin{array}[]{rcl}P_{2}(z)\vee Q_{v^{\prime}}(y_{1},y_{2})\leftarrow R(y_{1},z)\wedge R(y_{2},z)\\[2.84526pt] P_{1}(z)\leftarrow R(x,y_{1})\wedge S(x,y_{2})\wedge Q_{v^{\prime}}(y_{1},y_{2}).\end{array}

\begin{array}[]{rcl}P_{2}(z)\vee Q_{v^{\prime}}(y_{1},y_{2})\leftarrow R(y_{1},z)\wedge R(y_{2},z)\\[2.84526pt] P_{1}(z)\leftarrow R(x,y_{1})\wedge S(x,y_{2})\wedge Q_{v^{\prime}}(y_{1},y_{2}).\end{array}

P (x_{1}, x_{2} ∣ y_{1}, y_{1}, y_{2}) \leftarrow Q (y_{1} ∣ y_{1}, y_{1}, y_{2}) \land R (x_{1}, y_{1}, y_{2}, x_{2})

P (x_{1}, x_{2} ∣ y_{1}, y_{1}, y_{2}) \leftarrow Q (y_{1} ∣ y_{1}, y_{1}, y_{2}) \land R (x_{1}, y_{1}, y_{2}, x_{2})

\begin{array}[]{rcl}P(y\,|\,x)&\leftarrow&R(y\,|\,x)\\[1.42262pt] P(z\,|\,x)&\leftarrow&P(y\,|\,x)\wedge R(z,y)\\[1.42262pt] {\mathtt{goal}}(x)&\leftarrow&P(x\,|\,x)\end{array}

\begin{array}[]{rcl}P(y\,|\,x)&\leftarrow&R(y\,|\,x)\\[1.42262pt] P(z\,|\,x)&\leftarrow&P(y\,|\,x)\wedge R(z,y)\\[1.42262pt] {\mathtt{goal}}(x)&\leftarrow&P(x\,|\,x)\end{array}

\begin{array}[]{r@{\;}c@{\;}l}P_{0}(x)\vee P_{1}(y)&\leftarrow&R(x,y)\\[1.42262pt] {\mathtt{goal}}(x)&\leftarrow&P_{0}(x)\\[1.42262pt] P_{1}(y)&\leftarrow&P_{1}(x)\wedge R(x,y)\\[1.42262pt] {\mathtt{goal}}(x)&\leftarrow&P_{1}(x).\end{array}

\begin{array}[]{r@{\;}c@{\;}l}P_{0}(x)\vee P_{1}(y)&\leftarrow&R(x,y)\\[1.42262pt] {\mathtt{goal}}(x)&\leftarrow&P_{0}(x)\\[1.42262pt] P_{1}(y)&\leftarrow&P_{1}(x)\wedge R(x,y)\\[1.42262pt] {\mathtt{goal}}(x)&\leftarrow&P_{1}(x).\end{array}

C, D ::= ⊤ ∣ ⊥ ∣ A ∣ \neg C ∣ C ⊓ D ∣ C ⊔ D ∣ \exists r . C ∣ \exists r^{-} . C ∣ \forall r . C ∣ \forall r^{-} . C

C, D ::= ⊤ ∣ ⊥ ∣ A ∣ \neg C ∣ C ⊓ D ∣ C ⊔ D ∣ \exists r . C ∣ \exists r^{-} . C ∣ \forall r . C ∣ \forall r^{-} . C

\begin{array}[]{r@{\;}c@{\;}l}(\neg C)^{\mathcal{I}}&=&\Delta^{\mathcal{I}}\setminus C^{\mathcal{I}}\\[1.42262pt] (C\sqcap D)^{\mathcal{I}}&=&C^{\mathcal{I}}\cap D^{\mathcal{I}}\\[1.42262pt] (C\sqcup D)^{\mathcal{I}}&=&C^{\mathcal{I}}\cup D^{\mathcal{I}}\\[1.42262pt] (\exists r.C)^{\mathcal{I}}&=&\{d\in\Delta^{\mathcal{I}}\mid\exists e\in C^{\mathcal{I}}:(d,e)\in r^{\mathcal{I}}\}\\[1.42262pt] (\exists r^{-}.C)^{\mathcal{I}}&=&\{d\in\Delta^{\mathcal{I}}\mid\exists e\in C^{\mathcal{I}}:(e,d)\in r^{\mathcal{I}}\}\\[1.42262pt] (\forall r.C)^{\mathcal{I}}&=&\{d\in\Delta^{\mathcal{I}}\mid\forall e\in\Delta^{\mathcal{I}}:(d,e)\in r^{\mathcal{I}}\Rightarrow e\in C^{\mathcal{I}}\}\\[1.42262pt] (\forall r^{-}.C)^{\mathcal{I}}&=&\{d\in\Delta^{\mathcal{I}}\mid\forall e\in\Delta^{\mathcal{I}}:(e,d)\in r^{\mathcal{I}}\Rightarrow e\in C^{\mathcal{I}}\}.\end{array}

\begin{array}[]{r@{\;}c@{\;}l}(\neg C)^{\mathcal{I}}&=&\Delta^{\mathcal{I}}\setminus C^{\mathcal{I}}\\[1.42262pt] (C\sqcap D)^{\mathcal{I}}&=&C^{\mathcal{I}}\cap D^{\mathcal{I}}\\[1.42262pt] (C\sqcup D)^{\mathcal{I}}&=&C^{\mathcal{I}}\cup D^{\mathcal{I}}\\[1.42262pt] (\exists r.C)^{\mathcal{I}}&=&\{d\in\Delta^{\mathcal{I}}\mid\exists e\in C^{\mathcal{I}}:(d,e)\in r^{\mathcal{I}}\}\\[1.42262pt] (\exists r^{-}.C)^{\mathcal{I}}&=&\{d\in\Delta^{\mathcal{I}}\mid\exists e\in C^{\mathcal{I}}:(e,d)\in r^{\mathcal{I}}\}\\[1.42262pt] (\forall r.C)^{\mathcal{I}}&=&\{d\in\Delta^{\mathcal{I}}\mid\forall e\in\Delta^{\mathcal{I}}:(d,e)\in r^{\mathcal{I}}\Rightarrow e\in C^{\mathcal{I}}\}\\[1.42262pt] (\forall r^{-}.C)^{\mathcal{I}}&=&\{d\in\Delta^{\mathcal{I}}\mid\forall e\in\Delta^{\mathcal{I}}:(e,d)\in r^{\mathcal{I}}\Rightarrow e\in C^{\mathcal{I}}\}.\end{array}

\begin{array}[]{r@{\;}c@{\;}l@{\;\;\;}c}\mathcal{T}&=&\{\;\exists{\mathtt{hasAbn}}.{\mathtt{CTest}}\sqsubseteq{\mathtt{Smoker}}\sqcup\exists{\mathtt{hasRisk}}.{\mathtt{MTC}},&(1)\\[1.42262pt] &&\;\;\;\exists{\mathtt{hasAbn}}.{\mathtt{CTest}}\sqcap\exists{\mathtt{hasRisk}}.{\mathtt{MEN2}}\sqsubseteq\exists{\mathtt{hasRisk}}.{\mathtt{MTC}},&(2)\\ &&\;\;\;{\mathtt{PCCPatient}}\sqsubseteq\exists{\mathtt{hasRisk}}.{\mathtt{MEN2}}&(3)\\[1.42262pt] &&\;\;\;\exists{\mathtt{hasRelative}}.\exists{\mathtt{hasRisk}}.{\mathtt{MEN2}}\sqsubseteq\exists{\mathtt{hasRisk}}.{\mathtt{MEN2}}\ \}&(4)\\[1.42262pt] \mathbf{S}_{E}&=&\{\;{\mathtt{hasAbn}},{\mathtt{CTest}},{\mathtt{Smoker}},{\mathtt{hasRelative}},{\mathtt{PCCPatient}}\;\}\\[1.42262pt] q(x)&=&\;{\mathtt{hasRisk}}(x,y)\wedge{\mathtt{MTC}}(y).\end{array}

\begin{array}[]{r@{\;}c@{\;}l@{\;\;\;}c}\mathcal{T}&=&\{\;\exists{\mathtt{hasAbn}}.{\mathtt{CTest}}\sqsubseteq{\mathtt{Smoker}}\sqcup\exists{\mathtt{hasRisk}}.{\mathtt{MTC}},&(1)\\[1.42262pt] &&\;\;\;\exists{\mathtt{hasAbn}}.{\mathtt{CTest}}\sqcap\exists{\mathtt{hasRisk}}.{\mathtt{MEN2}}\sqsubseteq\exists{\mathtt{hasRisk}}.{\mathtt{MTC}},&(2)\\ &&\;\;\;{\mathtt{PCCPatient}}\sqsubseteq\exists{\mathtt{hasRisk}}.{\mathtt{MEN2}}&(3)\\[1.42262pt] &&\;\;\;\exists{\mathtt{hasRelative}}.\exists{\mathtt{hasRisk}}.{\mathtt{MEN2}}\sqsubseteq\exists{\mathtt{hasRisk}}.{\mathtt{MEN2}}\ \}&(4)\\[1.42262pt] \mathbf{S}_{E}&=&\{\;{\mathtt{hasAbn}},{\mathtt{CTest}},{\mathtt{Smoker}},{\mathtt{hasRelative}},{\mathtt{PCCPatient}}\;\}\\[1.42262pt] q(x)&=&\;{\mathtt{hasRisk}}(x,y)\wedge{\mathtt{MTC}}(y).\end{array}

hasAbn (john, t), CTest (t), Smoker (john), hasRelative (john, anna), PCCPatient (anna),

hasAbn (john, t), CTest (t), Smoker (john), hasRelative (john, anna), PCCPatient (anna),

\begin{array}[]{r@{\;}c@{\;}l}P(x_{3})&\leftarrow&A(x_{1})\wedge r(x_{1},x_{2})\wedge r(x_{2},x_{3})\wedge r(x_{3},x_{1})\\[1.42262pt] {\mathtt{goal}}()&\leftarrow&r(x_{1},x_{2})\wedge r(x_{2},x_{3})\wedge r(x_{3},x_{1})\,\wedge\\[1.42262pt] &&P(x_{1})\wedge P(x_{2})\wedge P(x_{3})\end{array}

\begin{array}[]{r@{\;}c@{\;}l}P(x_{3})&\leftarrow&A(x_{1})\wedge r(x_{1},x_{2})\wedge r(x_{2},x_{3})\wedge r(x_{3},x_{1})\\[1.42262pt] {\mathtt{goal}}()&\leftarrow&r(x_{1},x_{2})\wedge r(x_{2},x_{3})\wedge r(x_{3},x_{1})\,\wedge\\[1.42262pt] &&P(x_{1})\wedge P(x_{2})\wedge P(x_{3})\end{array}

\begin{array}[]{r@{\;}c@{\;}l}P(x_{3})&\leftarrow&R_{q_{1}}(x_{1},x_{2},x_{3})\\[1.42262pt] {\mathtt{goal}}()&\leftarrow&R_{q_{2}}(x_{1},x_{2},x_{3})\wedge P(x_{1})\wedge P(x_{2})\wedge P(x_{3})\\[1.42262pt] {\mathtt{goal}}()&\leftarrow&R_{q_{1}}(x_{1},x_{2},x_{3})\wedge P(x_{1})\wedge P(x_{2})\wedge P(x_{3})\end{array}

\begin{array}[]{r@{\;}c@{\;}l}P(x_{3})&\leftarrow&R_{q_{1}}(x_{1},x_{2},x_{3})\\[1.42262pt] {\mathtt{goal}}()&\leftarrow&R_{q_{2}}(x_{1},x_{2},x_{3})\wedge P(x_{1})\wedge P(x_{2})\wedge P(x_{3})\\[1.42262pt] {\mathtt{goal}}()&\leftarrow&R_{q_{1}}(x_{1},x_{2},x_{3})\wedge P(x_{1})\wedge P(x_{2})\wedge P(x_{3})\end{array}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\lmcsdoi

15215 \lmcsheadingLABEL:LastPageJun. 06, 2018May 23, 2019

\titlecomment\lsuper

The current paper is the extended version of an invited conference abstract [FKL17]. It contains detailed proofs of all results and new material regarding dichotomies and deciding PTime query evaluation.

Rewritability in Monadic Disjunctive Datalog, MMSNP, and Expressive Description Logics

Cristina Feier\rsupera

\lsuperaUniversity of Bremen, Department of Computer Science, Germany

,

Antti Kuusisto\rsuperb

\lsuperbTampere University, Mathematics, Finland

and

Carsten Lutz\rsuperb

Abstract.

We study rewritability of monadic disjunctive Datalog programs, (the complements of) MMSNP sentences, and ontology-mediated queries (OMQs) based on expressive description logics of the $\mathcal{ALC}$ family and on conjunctive queries. We show that rewritability into FO and into monadic Datalog (MDLog) are decidable, and that rewritability into Datalog is decidable when the original query satisfies a certain condition related to equality. We establish 2NExpTime-completeness for all studied problems except rewritability into MDLog for which there remains a gap between 2NExpTime and 3ExpTime. We also analyze the shape of rewritings, which in the case of MMSNP correspond to obstructions, and give a new construction of canonical Datalog programs that is more elementary than existing ones and also applies to non-Boolean queries.

Key words and phrases:

MDDLog, MMSNP, OMQ, FO-Rewritability, Monadic Datalog-Rewritability

\lsuper*The author was supported by ERC Consolidator Grant 647289 CODA

1. Introduction

In data access with ontologies, the premier aim is to answer queries over incomplete and heterogeneous data while taking advantage of the domain knowledge provided by an ontology [BtCLW14, CDL*+*09, BO15]. Since traditional database systems are often unaware of ontologies, it is common to rewrite the emerging ontology-mediated queries (OMQs) into more standard database query languages. For example, the DL-Lite family of description logics (DLs) was designed as an ontology language specifically so that any OMQ $Q=(\mathcal{T},\Sigma,q)$ where $\mathcal{T}$ is a DL-Lite ontology, $\Sigma$ a data signature, and $q$ a conjunctive query, can be rewritten into an equivalent first-order (FO) query that can then be executed using a standard SQL database system [CGL*+*07, ACKZ09]. In more expressive ontology languages, it is not guaranteed that for every OMQ, there is an equivalent FO query. For example, this is the case for DLs of the $\mathcal{E\kern-1.00006ptL}$ and Horn- $\mathcal{ALC}$ families and for DLs of the expressive $\mathcal{ALC}$ family; please see [BHLS17] for a general introduction to DLs. In many members of the $\mathcal{E\kern-1.00006ptL}$ and Horn- $\mathcal{ALC}$ families, however, rewritability into monadic Datalog (MDLog) is guaranteed, thus enabling the use of Datalog engines for query answering. In $\mathcal{ALC}$ and above, not even Datalog-rewritability is generally ensured. Since ontologies emerging from practical applications tend to be structurally simple, though, there is reason to hope that (FO-, MDLog-, and Datalog-) rewritings do exist in many practically relevant cases even when the ontology is formulated in an expressive language. This has in fact been experimentally confirmed for FO-rewritability in the $\mathcal{E\kern-1.00006ptL}$ family of DLs [HLISW15], and it has led to the implementation of rewriting tools that, although incomplete, are able to compute rewritings in many practical cases [PUMH10, KNG14, TSCS15].

Fundamental problems that emerge from this situation are to understand the exact limits of rewritability and to provide (complete) algorithms that decide the rewritability of a given OMQ and that compute a rewriting when it exists. These problems have been adressed in [BLW13, HLISW15, BHLW16, LS17] for DLs from the $\mathcal{E\kern-1.00006ptL}$ and Horn- $\mathcal{ALC}$ families. For DLs from the $\mathcal{ALC}$ family, first results were obtained in [BtCLW14] where a connection between OMQs and constraint satisfaction problems (CSPs) was established and then used to transfer decidability results from CSPs to OMQs. In fact, rewritability is an important topic in CSP (where it is called definability) as it constitutes a central tool for analyzing the complexity of CSPs [FV98, LLT07, ELT07, DL08]. In particular, it is known that deciding the rewritability of (the complement of) a given CSP into FO and into Datalog is NP-complete [LLT07, Bar16, CL17] and rewritability into MDLog is NP-hard and in ExpTime [CL17]. In [BtCLW14], these results were used to show that FO- and Datalog-rewritability of OMQs $(\mathcal{T},\Sigma,q)$ is decidable and in fact NExpTime-complete when $\mathcal{T}$ is formulated in $\mathcal{ALC}$ or a moderate extension thereof and $q$ is an atomic query (AQ), that is, a monadic query of the simple form $A(x)$ . For MDLog-rewritability, one can show NExpTime-hardness and containment in 2ExpTime.

The aim of this paper is to study the above questions for OMQs where the ontology is formulated in an expressive DL from the $\mathcal{ALC}$ family and where the actual query is a conjunctive query (CQ) or a union of conjunctive queries (UCQ). As observed in [BtCLW14], transitioning in OMQs from AQs to UCQs corresponds to the transition from CSP to its logical generalization MMSNP introduced by Feder and Vardi [FV98] and studied, for example, in [MS07, Mad09, Mad10, BCF12]. More precisely, while the OMQ language $(\mathcal{ALC},\text{AQ})$ that consists of all OMQs $(\mathcal{T},\Sigma,q)$ where $\mathcal{T}$ is formulated in $\mathcal{ALC}$ and $q$ is an AQ has the same expressive power as the complement of CSP (with multiple templates and a single constant), the OMQ language $(\mathcal{ALC},\text{UCQ})$ has the same expressive power as the complement of MMSNP (with free variables)—which in turn is essentially a notational variant of monadic disjunctive Datalog (MDDLog). It should be noted, however, that while all these formalisms are equivalent in expressive power, they differ significantly in succinctness [BtCLW14]; in particular, the best known translation of OMQs from $(\mathcal{ALC},\text{UCQ})$ into MMSNP/MDDLog involves a double exponential blowup. In contrast to the CSP case, FO-, MDLog-, and Datalog-rewritability of (complemented) MMSNP sentences was not known to be decidable. In this paper, we establish decidability of FO- and MDLog-rewritability in $(\mathcal{ALC},\text{UCQ})$ and related OMQ languages, in MDDLog, and in complemented MMSNP. We show that FO-rewritability is 2NExpTime-complete in all three cases, and that MDLog-rewritability is in 3ExpTime; a 2NExpTime lower bound was established in [BL16]. Let us discuss our results on the complexity of FO-rewritability from three different perspectives. From the OMQ perspective, the transition from AQs to UCQs results in an increase of complexity from NExpTime to 2NExpTime. From the monadic Datalog perspective, adding disjunction (transitioning from monadic Datalog to MDDLog) results in a moderate increase of complexity from 2ExpTime [BtCCV15] to 2NExpTime. And from the CSP perspective, the transition from CSPs to MMSNP results in a rather dramatic complexity jump from NP to 2NExpTime.

For Datalog-rewritability, we obtain only partial results. In particular, we show that Datalog-rewritability is decidable and 2NExpTime-complete for MDDLog programs that, in a certain technical sense made precise in the paper, have equality. For the general case, we only obtain a potentially incomplete procedure. It is well possible that the procedure is in fact complete, but proving this remains an open issue for now. These results also apply to analogously defined classes of MMSNP sentences and OMQs that have equality.

While we mainly focus on deciding whether a rewriting exists rather than actually computing it, we also analyze the shape that rewritings can take. Since the shape turns out to be rather restricted, this is important information for algorithms (complete or incomplete) that seek to compute rewritings. In the CSP/MMSNP world, this corresponds to analyzing obstruction sets for MMSNP, in the style of CSP obstructions [Nes08, BKL08, Ats08] and not to be confused with colored forbidden patterns sometimes used to characterize MMSNP [MS07]. More precisely, we show that an OMQ $(\mathcal{T},\Sigma,q)$ from $(\mathcal{ALC},\text{UCQ})$ is FO-rewritable if and only if it is rewritable into a UCQ in which each CQ has treewidth $(1,\max\{2,n_{q}\})$ , where $n_{q}$ is the size of $q$ ;111What we mean here is that $q$ has a tree decomposition in which every bag has at most $\max\{2,n_{q}\}$ elements and in which neighboring bags overlap in at most one element. similarly, the complement of an MMSNP sentence $\varphi$ is FO-definable if and only if it admits a finite set of finite obstructions of treewidth $(1,k)$ where $k$ is the diameter of $\varphi$ (the maximum number of variables in a negated conjunction in its body, in Feder and Vardi’s terminology). We also show that an OMQ $(\mathcal{T},\Sigma,q)$ is MDLog-rewritable if and only if it is rewritable into an MDLog program of diameter $\max\{2,n_{q}\}$ where the diameter of an MDLog program is the maximum number of variables in a rule; equivalently, the complemented of an MMSNP sentence $\varphi$ is MDLog-definable if and only if it admits a (potentially infinite) set of finite obstructions of treewidth $(1,k)$ where $k$ is the diameter of $\varphi$ . For the case of rewriting into Datalog, we give a new and direct construction of canonical Datalog-rewritings of MMSNP sentences. It has been observed in [FV98] that for every CSP and all $\ell,k$ , it is possible to construct a canonical Datalog program $\Pi$ of width $\ell$ and diameter $k$ (the width is the maximum arity of IDB relations) in the sense that if any such program is a rewriting of the CSP, then so is $\Pi$ ; moreover, even when there is no $(\ell,k)$ -Datalog rewriting, then $\Pi$ is the best possible approximation of such a rewriting. The existence of canonical Datalog-rewritings for (complemented) MMSNP sentences was already known from [BD13]. However, the construction given there is quite complex, proceeding via an infinite template that is obtained by applying an intricate construction due to Cherlin, Shelah, and Shi [CSS99], and resulting in canonical programs that are rather difficult to understand and to analyze. In contrast, our construction is elementary and essentially parallels the CSP case; it also applies to MMSNP formulas with free variables, where the canonical program takes a rather special form that involves parameters, similar in spirit to the parameters to least fixed-point operators in FO(LFP) [BBV16].

Our main technical tool is the translation of MMSNP sentences into CSPs exhibited by Feder and Vardi [FV98]; actually, the target of the translation is a generalized CSP, meaning that there are multiple templates. The translation is not equivalence preserving and involves a double exponential blowup, but it was designed so as to preserve complexity up to polynomial time reductions. Here, we are primarily interested in the semantic relationship between the original MMSNP sentence and the constructed CSP. It turns out that the translation does not quite preserve rewritability. In particular, when the original MMSNP sentence has a rewriting, then the natural way of constructing from it a rewriting for the CSP is sound only on instances of high girth. However, FO- and MDLog-rewritings of CSPs that are sound on high girth (and unconditionally complete) can be converted into rewritings that are unconditionally sound (and complete). The same is true for Datalog-rewritings when the CSP is derived from an MMSNP sentence that has equality, but it remains open whether it is true for Datalog-rewritings of unrestricted CSPs.

With our translations in place, we can also make relevant observations regarding the (data) complexity of query evaluation in MMSNP, in MDDLog, and of OMQs. This is especially interesting in the light of the recently obtained breakthrough in CSPs that there is a dichotomy between PTime and NP in the complexity of CSPs [Bul17, Zhu17] and that it is decidable and NP-complete whether a CSP is in PTime [CL17]. We show that this implies a dichotomy between PTime and coNP for MDDLog and for all OMQ languages mentioned above. We also prove that PTime query evaluation is decidable and 2NExpTime-complete in the mentioned query languages, and that the same is true for MMSNP.

The structure of this paper is as follows. In Section 2, we introduce the essentials of disjunctive Datalog and its relevant fragments as well as CSP and MMSNP; in fact, we shall always work with Boolean MDDLog rather than with complemented MMSNP. In Section 3, we summarize the main properties of Feder and Vardi’s translation of MMSNP into CSP. We use this in Section 4 to show that FO- and MDLog-rewritability of Boolean MDDLog programs and of the complement of an MMSNP sentences is decidable, also establishing the announced complexity results. In Section 5, we analyze the shape of FO- and MDLog-rewritings and of obstructions for MMSNP sentences. We also establish an MMSNP analogue of an essential combinatorial lemma for CSPs which says that it is possible to replace a structure by a structure of high high girth while preserving certain homomorphisms; the MMSNP analogue achieves high ‘decomposition girth’ (defined in the paper) and preserves the truth of certain MMSNP sentences. In Section 6, we study Datalog-rewritability of MDDLog programs that have equality and construct canonical Datalog programs. Section 7 is concerned with lifting our results from the Boolean case to the general case, concerning the complexity of deciding rewritability, the shape of rewritings, and the construction of canonical Datalog programs. In this section, Datalog programs with parameters play a central role. In Section 8, we introduce OMQs and further lift our results to that setting, finally arriving at our goal to study fundamental rewritability questions for OMQ languages based on (unions of) conjunctive queries. Section 9 is then concerned with dichotomies and the complexity of deciding PTime query evaluation. We conclude in Section 10.

2. Preliminaries

We introduce disjunction Datalog, CSP, and MMSNP. To avoid overloading this section, the introduction of ontology languages and ontology-mediated queries is deferred to Section 8.

A schema is a finite collection $\mathbf{S}=(S_{1},\dots,S_{k})$ of relation symbols with associated arity. An $\mathbf{S}$ -fact is an expression of the form $S(a_{1},\ldots,a_{n})$ where $S\in\mathbf{S}$ is an $n$ -ary relation symbol, and $a_{1},\ldots,a_{n}$ are elements of some fixed, countably infinite set ${\mathtt{const}}$ of constants. For an $n$ -ary relation symbol $S$ , ${\mathtt{pos}}(S)$ is $\{1,\ldots,n\}$ . An $\mathbf{S}$ -instance $I$ is a finite set of $\mathbf{S}$ -facts. The active domain $\mathsf{dom}(I)$ of $I$ is the set of all constants that occur in the facts in $I$ . We use the symbols $\mathbf{a}$ , $\mathbf{b}$ , $\mathbf{c}$ to denote tuples of constants and, slightly abusing notation, write $\mathbf{a}\subseteq{\mathtt{dom}}(I)$ to mean that $\mathbf{a}$ is a tuple of constants from ${\mathtt{dom}}(I)$ when we do not want to be precise about the length of the tuple. For an instance $I$ and a schema $\mathbf{S}$ , we write $I|_{\mathbf{S}}$ to denote the restriction of $I$ to the relation symbols in $\mathbf{S}$ .

A tree decomposition of an instance $I$ is a pair $(T,(B_{v})_{v\in V})$ , where $T=(V,E)$ is an undirected tree and $(B_{v})_{v\in V}$ is a family of subsets of $\mathsf{dom}(I)$ such that the following conditions are satisfied:

(1)

for all $a\in\mathsf{dom}(I)$ , $\{v\in V\mid a\in B_{v}\}$ is nonempty and connected in $T$ ; 2. (2)

for every fact $R(a_{1},\ldots a_{r})$ in $I$ , there is a $v\in V$ such that $a_{1},\ldots,a_{r}\in B_{v}$ .

Unlike in the traditional setup [FG06], we are interested in two parameters of tree decompositions instead of only one. We call $(T,(B_{v})_{v\in V})$ an $(\ell,k)$ -tree decomposition if for all distinct $v,v^{\prime}\in V$ , $|B_{v}\cap B_{v^{\prime}}|\leq\ell$ and $|B_{v}|\leq k$ . An instance $I$ has treewidth $(\ell,k)$ if it admits an $(\ell,k)$ -tree decomposition.

We now define the notion of girth. A finite structure $I$ has a cycle of length $n>0$ if there are distinct facts $R_{0}(\mathbf{a}_{0}),\dots,R_{n-1}(\mathbf{a}_{n-1})\in I$ and positions $p_{0},p^{\prime}_{0}\in{\mathtt{pos}}(R_{0}),\dots,p_{n-1},p^{\prime}_{n-1}\in{\mathtt{pos}}(R_{n-1})$ such that

•

$p_{i}\neq p^{\prime}_{i}$ for $0\leq i<n$ and

•

the constant in the $p^{\prime}_{i}$ -th position of $\mathbf{a}_{i}$ is identical to the constant in the $p_{i+1}$ -th position of $\mathbf{a}_{i+1}$ for $0\leq i<n$ and with $p_{n}:=p_{0}$ and $p^{\prime}_{n}:=p^{\prime}_{0}$ .

The girth of $I$ is the length of the shortest cycle in it and $\omega$ if $I$ has no cycle (in which case we say that $I$ is a tree).

A constraint satisfaction problem (CSP) is defined by an instance $T$ over a schema $\mathbf{S}_{E}$ , called template.222Adopting Datalog terminology, we generally call the schema in which the data is formulated the EDB schema and denote it with $\mathbf{S}_{E}$ . The problem associated with $T$ , denoted $\text{CSP}(T)$ , is to decide whether an input instance $I$ over $\mathbf{S}_{E}$ admits a homomorphism to $T$ , denoted $I\rightarrow T$ . We use coCSP $(T)$ to denote the complement problem, that is, deciding whether $I\not\rightarrow T$ . A generalized CSP is defined by a set of templates $S$ over the same schema $\mathbf{S}_{E}$ and asks for a homomorphism from the input $I$ to at least one of the templates $T\in S$ , denoted $I\rightarrow S$ . Note that a (generalized) CSP can be viewed as a Boolean query over $\mathbf{S}_{E}$ -instances.

An MMSNP sentence $\theta$ over schema $\mathbf{S}_{E}$ has the form $\exists X_{1}\cdots\exists X_{n}\forall x_{1}\cdots\forall x_{m}\varphi$ with $X_{1},\dots,X_{n}$ monadic second-order variables, $x_{1},\dots,x_{m}$ first-order variables, and $\varphi$ a conjunction of quantifier-free formulas of the form

[TABLE]

with $n,m\geq 0$ and where each $\alpha_{i}$ takes the form $X_{i}(x_{j})$ or $R(\mathbf{x})$ with $R\in\mathbf{S}_{E}$ and each $\beta_{i}$ takes the form $X_{i}(x_{j})$ . The diameter of $\theta$ is the maximum number of variables in some implication in $\varphi$ . This presentation is syntactically different from, but semantically equivalent to the original definition from [FV98], which does not use the implication symbol and instead restricts the allowed polarities of atoms. Both forms can be interconverted in polynomial time, see [BtCLW14]. The semantics of MMSNP is the standard semantics of second-order logic. More information can be found, e.g., in [BCF12, BD13]. Note that, just like CSPs, MMSNP sentences can be viewed as Boolean queries.

A conjunctive query (CQ) takes the form $q(\mathbf{x})=\exists\mathbf{y}\,\varphi(\mathbf{x},\mathbf{y})$ where $\varphi$ is a conjunction of relational atoms and $\mathbf{x}$ , $\mathbf{y}$ denote tuples of variables; the equality relation may be used. Whenever convenient, we shall confuse $q(\mathbf{x})$ with the set of atoms in $\varphi$ . The variables in $\mathbf{x}$ are the answer variables in $q(\mathbf{x})$ . The arity of $q(\mathbf{x})$ is the number of its answer variables and $q(\mathbf{x})$ is Boolean if it has arity zero. We say that $q(\mathbf{x})$ is over $\mathbf{S}_{E}$ if $\varphi$ uses only relation symbols from $\mathbf{S}_{E}$ . An answer to $q$ on an $\mathbf{S}_{E}$ -instance $I$ is a tuple of constants $\mathbf{a}$ such that $I\models q(\mathbf{a})$ in the standard sense of first-order logic. It is well-known that $I\models q(\mathbf{a})$ if and only if there is a homomorphism from $\varphi$ viewed as an instance to $I$ that takes $\mathbf{x}$ to $\mathbf{a}$ . A Boolean CQ $q$ is true on an instance $I$ , denoted $I\models q$ , if the empty tuple is an answer to $q$ on $I$ . A CQ $q$ is a contraction of a CQ $q^{\prime}$ if it can be obtained from $q^{\prime}$ by identifying variables. A union of conjunctive queries (UCQ) is a disjunction of CQs with the same answer variables. The semantics of UCQs is defined in the expected way.

A disjunctive Datalog rule $\rho$ has the form

[TABLE]

with $n>0$ and $m\geq 0$ . We refer to $S_{1}(\mathbf{x}_{1})\vee\cdots\vee S_{m}(\mathbf{x}_{m})$ as the head of $\rho$ , and to $R_{1}(\mathbf{y}_{1})\wedge\cdots\wedge R_{n}(\mathbf{y}_{n})$ as the body. Every variable that occurs in the head of $\rho$ is required to also occur in the body of $\rho$ . A disjunctive Datalog (DDLog) program $\Pi$ is a finite set of disjunctive Datalog rules with a selected goal relation ${\mathtt{goal}}$ that does not occur in rule bodies and appears only in non-disjunctive goal rules ${\mathtt{goal}}(\mathbf{x})\leftarrow R_{1}(\mathbf{x}_{1})\wedge\cdots\wedge R_{n}(\mathbf{x}_{n})$ . The arity of $\Pi$ is the arity of the ${\mathtt{goal}}$ relation; we say that $\Pi$ is Boolean if it has arity zero. Relation symbols that occur in the head of at least one rule of $\Pi$ are intensional (IDB) relations, and all remaining relation symbols in $\Pi$ are extensional (EDB) relations. Note that, by definition, ${\mathtt{goal}}$ is an IDB relation. When all relations in $\Pi$ are from schema $\mathbf{S}_{E}$ , then we say that $\Pi$ is over EDB schema $\mathbf{S}_{E}$ . The IDB schema of $\Pi$ is the set of all IDB relations in $\Pi$ .

We will sometimes use body atoms of the form ${\mathtt{true}}(x)$ that are vacuously true for all elements of the active domain. This is just syntactic sugar since any rule with body atom ${\mathtt{true}}(x)$ can equivalently be replaced by a set of rules obtained by replacing ${\mathtt{true}}(x)$ in all possible ways with an atom $R(x_{1},\dots,x_{n})$ where $R$ is a relation symbol from $\mathbf{S}_{E}$ and where $x_{i}=x$ for some $i$ and all other $x_{i}$ are fresh variables.

A DDLog program is called monadic or an MDDLog program if all its IDB relations with the possible exception of ${\mathtt{goal}}$ have arity at most one. The size of a DDLog program $\Pi$ is the number of symbols needed to write it (where relation symbols and variable names count one), its width is the maximum arity of non- ${\mathtt{goal}}$ IDB relations used in it, and its diameter is the maximum number of variables that occur in a rule in $\Pi$ .

A Datalog rule is a disjunctive Datalog rule in which the rule head contains exactly one disjunct. Datalog (DLog) programs and monadic Datalog (MDLog) programs are then defined in the expected way. We call a Datalog program an $(\ell,k)$ -Datalog program if its width is $\ell$ and its diameter is $k$ .

For $\Pi$ an $n$ -ary DDLog program over schema $\mathbf{S}_{E}$ , an $\mathbf{S}_{E}$ -instance $I$ , and $a_{1},\ldots,a_{n}\in\mathsf{dom}(I)$ , we write $I\models\Pi(a_{1},\ldots,a_{n})$ if $\Pi\cup I\models{\mathtt{goal}}(a_{1},\ldots,a_{n})$ where the variables in all rules of $\Pi$ are universally quantified and thus $\Pi$ is a set of first-order (FO) sentences; please see [AHV95, EGM97] for more information on the semantics of (disjunctive) Datalog. A query $q(\mathbf{x})$ over $\mathbf{S}_{E}$ of arity $n$ is

•

sound for $\Pi$ if for all $\mathbf{S}_{E}$ -instances $I$ and $\mathbf{a}\subseteq\mathsf{dom}(I)$ , $I\models q(\mathbf{a})$ implies $I\models\Pi(\mathbf{a})$ ;

•

complete for $\Pi$ if for all $\mathbf{S}_{E}$ -instances $I$ and $\mathbf{a}\subseteq\mathsf{dom}(I)$ , $I\models\Pi(\mathbf{a})$ implies $I\models q(\mathbf{a})$ ;

•

a rewriting of $\Pi$ if it is sound for $\Pi$ and complete for $\Pi$ .

Note that Boolean programs are also covered by the above definitions. To additionally specify the syntactic shape of $q(\mathbf{x})$ , we speak of a UCQ-rewriting, an MDLog-rewriting, and so on. An FO-rewriting takes the form of an FO-query that uses only relation symbols from the relevant EDB schema and possibly equality, but neither constants nor function symbols. We say that an MDDLog program $\Pi$ is FO-rewritable if there is an FO-rewriting of $\Pi$ , and likewise for UCQ-rewritability and for MDLog-rewritability. Since a generalized CSP defined by a set of templates $S$ can be viewed as a Boolean query, we can also speak of a query to be sound and complete for respectively a rewriting of coCSP $(S)$ . The definitions are as expected, paralleling the ones above.

It was shown in [BtCLW14] that the complement of an MMSNP sentence can be translated into an equivalent Boolean MDDLog program in polynomial time and vice versa; moreover, the transformations preserve diameter and all other parameters relevant for this paper. From now on, we will thus not explicitly distinguish between Boolean MDDLog and (complemented) MMSNP.

3. MDDLog, Simple MDDLog and CSP

Feder and Vardi show how to translate an MMSNP sentence into (the complemen of) a generalized CSP that has the same complexity up to polynomial time reductions [FV98]. The resulting CSP has a different schema than the original MMSNP sentence and is thus not equivalent to it. We are going to make use of this translation to reduce rewritability problems for MDDLog to the corresponding problems for CSPs. Consequently, our main interest is in the precise semantic relationship between the MMSNP sentence and the constructed CSP, rather than in their complexity. In this section, we sum up the properties of the results obtained in [FV98] that are relevant for us. These properties are all we need in later sections, that is, we do not need to go further into the details of the translation. For the reader’s convenience and information, we still describe the translations in full detail in Appendix A; these are based on the presentation given in [BL16], which is more detailed than the original presentation in [FV98].

Let $\mathbf{S}_{E}$ be a schema. A schema $\mathbf{S}^{\prime}_{E}$ is a $k$ -aggregation schema for $\mathbf{S}_{E}$ if its relations have the form $R_{q(\mathbf{x})}$ where $q(\mathbf{x})$ is a CQ over $\mathbf{S}_{E}$ without quantified variables and the arity of $R_{q(\mathbf{x})}$ is identical to the length of $\mathbf{x}$ , which is at most $k$ . The generalized CSP to be constructed later makes use of a schema of this form. What is important at this point is that there are natural translations of instances between the two schemas. To make this precise, let $I$ be an $\mathbf{S}_{E}$ -instance. The corresponding $\mathbf{S}^{\prime}_{E}$ -instance $I^{\prime}$ consists of all facts $R_{q(\mathbf{x})}(\mathbf{a})$ such that $I\models q(\mathbf{a})$ . Conversely, let $I^{\prime}$ be an $\mathbf{S}^{\prime}_{E}$ -instance. The corresponding $\mathbf{S}_{E}$ -instance $I$ consists of all facts $S(\mathbf{b})$ such that $R_{q(\mathbf{x})}(\mathbf{a})\in I^{\prime}$ and $S(\mathbf{b})$ is a conjunct of $q(\mathbf{a})$ .

{exa}

Let $\mathbf{S}_{E}=\{r\}$ with $r$ a binary relation symbol,

[TABLE]

and let $\mathbf{S}^{\prime}_{E}$ consist of $R_{q(\mathbf{x})}$ where $\mathbf{x}=(x_{1},x_{2},x_{3})$ for brevity. Take the $\mathbf{S}^{\prime}_{E}$ -instance $I^{\prime}$ defined by

[TABLE]

The corresponding $\mathbf{S}_{E}$ -instance $I$ is

[TABLE]

Note that when we transition from $\mathbf{S}_{E}$ back to $\mathbf{S}^{\prime}_{E}$ and take the $\mathbf{S}^{\prime}_{E}$ -instance $I^{\prime\prime}$ corresponding to $I$ , we do not obtain $I^{\prime}$ , but rather a strict superset that contains additional facts such as $R_{q(\mathbf{x})}(c^{\prime},b^{\prime},a^{\prime})$ . This is illustrated in Figure 1 which shows the instances $I$ , $I^{\prime}$ , and a subset of $I^{\prime\prime}$ that contains all facts from $I$ plus one additional fact.

The translation in [FV98] consists of two steps. We describe them here using Boolean MDDLog instead of (complemented) MMSNP. The first step is to transform the given Boolean MDDLog program $\Pi$ into a Boolean MDDLog program $\Pi_{S}$ over a suitable aggregation schema $\mathbf{S}^{\prime}_{E}$ such that $\Pi_{S}$ is of a restricted syntactic form called simple. In the second step, one transforms $\Pi_{S}$ into a generalized CSP whose complement is equivalent to $\Pi_{S}$ .

We start with summing up the important aspects of the first step. A Boolean MDDLog program $\Pi_{S}$ is simple if it satisfies the following conditions:

(1)

every rule in $\Pi_{S}$ contains at most one EDB atom and this atom contains all variables of the rule body, each variable exactly once; 2. (2)

rules without an EDB atom contain at most a single variable.

Now, the first step achieves the following. {thmC}[[FV98]]

Given a Boolean MDDLog program $\Pi$ over EDB schema $\mathbf{S}_{E}$ of diameter $k$ , one can construct a simple Boolean MDDLog program $\Pi_{S}$ over a $k$ -aggregation EDB schema $\mathbf{S}^{\prime}_{E}$ for $\mathbf{S}_{E}$ and IDB schema $\mathbf{S}^{\prime}_{I}$ such that

(1)

If $I$ is an $\mathbf{S}_{E}$ -instance and $I^{\prime}$ the corresponding $\mathbf{S}^{\prime}_{E}$ -instance, then $I\models\Pi$ iff $I^{\prime}\models\Pi_{S}$ ; 2. (2)

If $I^{\prime}$ is an $\mathbf{S}^{\prime}_{E}$ -instance and $I$ the corresponding $\mathbf{S}_{E}$ -instance, then

(a)

$I^{\prime}\models\Pi_{S}$ implies $I\models\Pi$ ; 2. (b)

$I\models\Pi$ implies $I^{\prime}\models\Pi_{S}$ if the girth of $I^{\prime}$ exceeds $k$ .

If $\Pi$ is of size $n$ , then the size of $\Pi_{S}$ and the cardinality of $\mathbf{S}^{\prime}_{E}\cup\mathbf{S}^{\prime}_{I}$ are bounded by $2^{p(k\cdot{\mathtt{log}}n)}$ , where $p$ is a polynomial. The construction takes time polynomial in the size of $\Pi_{S}$ .

The translation underlying Theorem 1 consists of three steps itself: first saturate $\Pi$ by adding all rules that can be obtained as a contraction of a rule in $\Pi$ , that is, by identifying variables in the rule body and head in a consistent way. Then rewrite $\Pi$ in an equivalence-preserving way so that all rule bodies are biconnected, introducing fresh unary and nullary IDB relations as needed. And finally replace the conjunction $q(\mathbf{x})$ of all EDB atoms in each rule body with a single EDB atom $R_{q(\mathbf{x})}(\mathbf{x})$ , additionally taking care of interactions between the new EDB relations that arise e.g. when we have two relations $R_{q(\mathbf{x})}$ and $R_{p(\mathbf{x})}$ such that $q(\mathbf{x})$ is contained in $p(\mathbf{x})$ in the sense of query containment. Details are in Appendix A.1.

The following theorem summarizes the second step of the translation of Boolean MDDLog into a generalized CSP. {thmC}[[FV98]]

Let $\Pi$ be a simple Boolean MDDLog program over EDB schema $\mathbf{S}_{E}$ and with IDB schema $\mathbf{S}_{I}$ , $m$ the maximum arity of relations in $\mathbf{S}_{E}$ . Then there exists a set of templates $S_{\Pi}$ over $\mathbf{S}_{E}$ such that

(1)

$\Pi$ is equivalent to coCSP $(S_{\Pi})$ ; 2. (2)

$|S_{\Pi}|\leq 2^{|\mathbf{S}_{I}|}$ and $|T|\leq|\mathbf{S}_{E}|\cdot 2^{m|\mathbf{S}_{I}|}$ for each $T\in S_{\Pi}$ ;

The construction takes time polynomial in $\sum_{T\in S_{\Pi}}|T|$ .

We again sketch the idea underlying the proof of the theorem. The desired set of templates $S_{\Pi}$ contains one template for every 0-type, that is, for every set of nullary IDB relations in $\Pi$ that does not contain ${\mathtt{goal}}()$ and that satisfies all rules in $\Pi$ which use only nullary IDBs. Each template contains one constant $c_{M}$ for every 1-type $M$ , that is, for every set $M$ of unary IDBs that agrees on nullary IDBs with the 0-type for which the template was constructed and that satisfy all rules in $\Pi$ which use only IDB relations that are at most unary. One then interprets all EDB relations in a maximal way so that all rules in $\Pi$ are satisfied. The fact that $\Pi$ is simple implies that no choices arise, that is, there is only one maximal interpretation of each EDB relation and the interpretations of different such relations do not interact. Details are given in Appendix A.2.

4. FO- and MDLog-Rewritability of Boolean MDDLog Programs

We exploit the translations described in the previous section and the known results that FO-rewritability of CSPs and MDLog-rewritability of coCSPs are decidable to obtain analogous results for Boolean MDDLog, and thus also for MMSNP. In the case of FO-rewritability, we obtain tight 2NExpTime complexity bounds. For MDLog-rewritability, the exact complexity remains open (as in the CSP case), between 2NExpTime and 3ExpTime.

We start with observing that FO-rewritability and MDLog-rewritability are more closely related than one might think at first glance. Recall that, by Rossman’s homomorphism preservation theorem [Ros08], a first-order formula is preserved under homomorphisms on finite structures if and only if it is equivalent to a UCQ. While every MDLog-rewriting can be viewed as an infinitary UCQ-rewriting, Rossman’s result implies that FO-rewritability of a Boolean MDDLog program coincides with (finitary) UCQ-rewritability. The latter is true also in the non-Boolean case.

Proposition 1.

Let $\Pi$ be an MDDLog program. Then $\Pi$ is FO-rewritable iff $\Pi$ is UCQ-rewritable.

Proof 4.1.

It is well known and easy to show that truth of disjunctive Datalog programs is preserved under homomorphisms. Thus, the proposition immediately follows from Rossman’s theorem in the Boolean case. For the non-Boolean case, we observe that Rossman establishes his result also in the presence of constants. Let $\Pi$ be an MDDLog program and $\varphi(\mathbf{x})$ a rewriting of $\Pi$ . We can apply Rossman’s result to $\varphi(\mathbf{a})$ , where $\mathbf{a}$ is a tuple of constants of the same length as $\mathbf{x}$ , obtaining a UCQ $q(\mathbf{a})$ equivalent to $\varphi(\mathbf{a})$ . Let $q(\mathbf{x})$ be obtained from $q(\mathbf{a})$ by replacing the constants in $\mathbf{a}$ with the variables from $\mathbf{x}$ . It can be verified that $q(\mathbf{x})$ is a rewriting of $\Pi$ .

For utilizing the translation of Boolean MDDLog programs to generalized CSPs in the intended way, the interesting aspect is to deal with the translation of a Boolean MDDLog program $\Pi$ into a simple program $\Pi_{S}$ stated in Theorem 1, since it is not equivalence preserving. The following proposition relates rewritings of $\Pi$ to rewritings of $\Pi_{S}$ . It also applies to Datalog-rewritings, which we will make use of in Section 6.

Lemma 2.

Let $\Pi$ be a Boolean MDDLog program of diameter $k$ , $\Pi_{S}$ as in Theorem 1, and $\mathcal{Q}\in\{\text{UCQ},\text{MDLog},\text{DLog}\}$ . Then

(1)

every $\mathcal{Q}$ -rewriting of $\Pi_{S}$ can effectively be converted into a $\mathcal{Q}$ -rewriting of $\Pi$ ; 2. (2)

every $\mathcal{Q}$ -rewriting of $\Pi$ can effectively be converted into a $\mathcal{Q}$ -rewriting of $\Pi_{S}$ that is (i) sound on instances of girth exceeding $k$ and (ii) complete.

Proof 4.2.

Let $\mathbf{S}_{E}$ and $\mathbf{S}^{\prime}_{E}$ be the EDB schema of $\Pi$ and of $\Pi_{S}$ , respectively. We start with the case $\mathcal{Q}=\text{UCQ}$ .

For Point 1, let $q_{\Pi_{S}}$ be a UCQ-rewriting of $\Pi_{S}$ . Let $q_{\Pi}$ be the UCQ obtained from $q_{\Pi_{S}}$ by replacing every atom $R_{q(\mathbf{x})}(\mathbf{y})$ with $q[\mathbf{y}/\mathbf{x}]$ , that is, with the result of replacing the variables $\mathbf{x}$ in $q(\mathbf{x})$ with the variables $\mathbf{y}$ (which may lead to identifications). We show that $q_{\Pi}$ is as required. Let $I$ be an $\mathbf{S}_{E}$ -instance and $I^{\prime}$ the corresponding $\mathbf{S}^{\prime}_{E}$ -instance. Then we have $I\models\Pi$ iff $I^{\prime}\models\Pi_{S}$ (by Point 1 of Theorem 1) iff $I^{\prime}\models q_{\Pi_{S}}$ (by choice of $q_{\Pi_{S}}$ ) iff $I\models q_{\Pi}$ (by construction of $I^{\prime}$ and of $q_{\Pi}$ ). Let us expand on the latter.

First assume that $I^{\prime}\models q_{\Pi_{S}}$ . Then there is a CQ $q$ in $q_{\Pi_{S}}$ and a homomorphism $h$ from $q$ to $I^{\prime}$ . By construction, $q_{\Pi}$ contains a CQ $q^{\prime}$ that is obtained from $q$ by replacing every atom $R_{q(\mathbf{x})}(\mathbf{y})\in q$ with $q[\mathbf{y}/\mathbf{x}]$ . Clearly, for every atom $R_{q(\mathbf{x})}(\mathbf{y})\in q$ , we must have $R_{q(x)}(h(\mathbf{y}))\in I^{\prime}$ . The construction of $I^{\prime}$ yields $q(h(\mathbf{y}))\subseteq I$ . Consequently, $h$ is also a homomorphism from $q^{\prime}$ to $I$ . Conversely, assume that there is a CQ $q^{\prime}$ in $q_{\Pi}$ and a homomorphism $h$ from $q^{\prime}$ to $I$ . Then there is a CQ $q$ in $q_{\Pi_{S}}$ from which $q^{\prime}$ was obtained by the described replacement operation. For every atom $R_{q(\mathbf{x})}(\mathbf{y})\in q$ , we must have $q(h(\mathbf{y}))\subseteq I$ . We obtain $R_{q(\mathbf{x})}(h(\mathbf{y}))\in q$ and thus $h$ is a homomorphism from $q$ to $I^{\prime}$ .

For Point 2, let $q_{\Pi}$ be a UCQ-rewriting of $\Pi$ . The UCQ $q_{\Pi_{S}}$ consists of all CQs that can be obtained as follows:

(1)

choose a CQ $\exists\mathbf{x}\,q(\mathbf{x})$ from $q_{\Pi}$ , a contraction $\exists\mathbf{x}^{\prime}\,q^{\prime}(\mathbf{x}^{\prime})$ of $\exists\mathbf{x}\,q(\mathbf{x})$ , and a partition $q_{1}(\mathbf{x}_{1}),\dots,q_{n}(\mathbf{x}_{n})$ of $q^{\prime}(\mathbf{x}^{\prime})$ ; 2. (2)

for each $i\in\{1,\dots,n\}$ , choose a relation $R_{p(\mathbf{z})}$ from $\mathbf{S}^{\prime}_{E}$ and a tuple $\mathbf{y}$ of $|\mathbf{z}|$ variables (repeated occurrences allowed) that are either from $\mathbf{x}_{i}$ or do not occur in $\mathbf{x}^{\prime}$ such that $q_{i}(\mathbf{x}_{i})\subseteq p[\mathbf{y}/\mathbf{z}]$ ; then replace $q_{i}(\mathbf{x}_{i})$ in $\exists\mathbf{x}^{\prime}\,q^{\prime}(\mathbf{x}^{\prime})$ with the single atom $R_{p(\mathbf{z})}(\mathbf{y})$ .

To establish that $q_{\Pi_{S}}$ is as desired, we show that for every $\mathbf{S}^{\prime}_{E}$ -instance $I^{\prime}$

(I)

$I^{\prime}\models q_{\Pi_{S}}$ * implies $I^{\prime}\models\Pi_{S}$ if $I^{\prime}$ is of girth exceeding $k$ (soundness) and* 2. (II)

$I^{\prime}\models\Pi_{S}$ * implies $I^{\prime}\models q_{\Pi_{S}}$ (completeness).*

Let $I$ be the $\mathbf{S}_{E}$ -instance corresponding to $I^{\prime}$ .

For Point (I), we observe that $I^{\prime}\models q_{\Pi_{S}}$ implies $I\models q_{\Pi}$ (by construction of $q_{\Pi_{S}}$ and of $I^{\prime}$ ) implies $I\models\Pi$ (by choice of $q_{\Pi}$ ) implies $I^{\prime}\models\Pi_{S}$ (by Point 2b of Theorem 1 and if $I^{\prime}$ is of girth exceeding $k$ ). Let us zoom into the first implication. Assume that $I^{\prime}\models q_{\Pi_{S}}$ . Then there is a CQ $\exists\mathbf{u}\,p_{0}(\mathbf{u})$ in $q_{\Pi_{S}}$ and a homomorphism $h$ from $p_{0}(\mathbf{u})$ to $I^{\prime}$ . There must be some CQ $\exists\mathbf{x}\,q(\mathbf{x})$ in $q_{\Pi}$ from which $\exists\mathbf{u}\,p_{0}(\mathbf{u})$ has been constructed in Steps 1 and 2 above. Let $q_{1}(\mathbf{x}_{1}),\dots,q_{n}(\mathbf{x}_{n})$ be as in this construction. It suffices to show that $h$ is a homomorphism from $q_{i}(\mathbf{x}_{i})$ to $I$ , for each $i$ . Thus fix a $q_{i}(\mathbf{x}_{i})$ . Then there is a relation $R_{p(\mathbf{z})}\in\mathbf{S}^{\prime}_{E}$ and a tuple $\mathbf{y}$ of variables that are either from $\mathbf{x}_{i}$ or do not occur in $\mathbf{x}^{\prime}$ such that $q_{i}(\mathbf{x}_{i})\subseteq p[\mathbf{y}/\mathbf{z}]$ and $R_{p(\mathbf{z})}(\mathbf{y})\in p_{0}(\mathbf{u})$ . Thus $R_{p(\mathbf{z})}(h(\mathbf{y}))\in I^{\prime}$ . By construction of $I^{\prime}$ , this yields $q_{i}(h(\mathbf{x}_{i}))\subseteq I$ and thus we are done.

For Point (II), we have that $I^{\prime}\models\Pi_{S}$ implies $I\models\Pi$ (by Point 2a of Theorem 1) implies $I\models q_{\Pi}$ (by choice of $q_{\Pi}$ ). It thus remains to show that $I\models q_{\Pi}$ implies $I^{\prime}\models q_{\Pi_{S}}$ . Thus assume that there is a CQ $\exists\mathbf{x}\,q(\mathbf{x})$ in $q_{\Pi}$ and a homomorphism $h$ from $q(\mathbf{x})$ to $I$ . We use $\exists\mathbf{x}\,q(\mathbf{x})$ and $h$ to guide the choices in Step 1 and Step 2 of the construction of CQs in $q_{\Pi_{S}}$ to exhibit a CQ $p_{0}$ in $q_{\Pi_{S}}$ such that $p_{0}\rightarrow I^{\prime}$ .

We start with Step 1. As $\exists\mathbf{x}^{\prime}\,q^{\prime}(\mathbf{x}^{\prime})$ , we use the contraction of $\exists\mathbf{x}\,q(\mathbf{x})$ obtained by identifying variables $x$ and $y$ whenever $h(x)=h(y)$ . Thus, $h$ is an injective homomorphism from $q^{\prime}(\mathbf{x}^{\prime})$ to $I$ . We next need to choose a partition of $q^{\prime}(\mathbf{x}^{\prime})$ . For every fact $R(\mathbf{a})\in I$ , choose a fact $R_{p(\mathbf{x}_{0})}(\mathbf{b})\in I^{\prime}$ that $R(\mathbf{a})$ was obtained from during the construction of $I$ and denote this fact with $\mu(R(\mathbf{a}))$ . Now let $q_{1}(\mathbf{x}_{1}),\dots,q_{n}(\mathbf{x}_{n})$ be the partition of $q^{\prime}(\mathbf{x}^{\prime})$ obtained by grouping together two atoms $R_{1}(\mathbf{y}_{1})$ and $R_{2}(\mathbf{y}_{2})$ if and only if $\mu(R_{1}(h(\mathbf{y}_{1})))=\mu(R_{2}(h(\mathbf{y}_{2})))$ . Let $\mu(q_{i})$ denote the (unique) value of $\mu$ for all the atoms in $q_{i}(\mathbf{x}_{i})$ .

Step 2 deals with each query $q_{i}(\mathbf{x}_{i})$ separately. We choose the relation $R_{p(\mathbf{z})}$ from $\mu(q_{i})=R_{p(\mathbf{z})}(\mathbf{b})$ , which clearly is in $\mathbf{S}^{\prime}_{E}$ . We choose the tuple $\mathbf{y}$ of variables based on the tuple of individuals $\mathbf{b}$ . Let $\mathbf{b}=b_{1},\dots,b_{n}$ . Then the $\ell$ -th variable in $\mathbf{y}$ is $y$ if $h(y)=b_{\ell}$ (which is well-defined since $h$ is injective) and a fresh variable if there is no such $y$ . This finishes the guiding process and thus gives rise to a query $p_{0}(\mathbf{u})$ in $q_{\Pi_{S}}$ .

It remains to argue that $h$ can be extended to a homomorphism $h^{\prime}$ from $p(\mathbf{u})$ to $I^{\prime}$ . Take a $q_{i}(\mathbf{x}_{i})$ and consider the corresponding atom $R_{p(\mathbf{z})}(\mathbf{y})$ in $p_{0}$ . Then all the facts in $q_{i}(h(\mathbf{x}))\subseteq I$ were obtained from the fact $\mu(q_{i})=R_{p(\mathbf{z})}(\mathbf{b})\in I^{\prime}$ during the construction of $I$ . By construction of $\mathbf{y}$ from $\mathbf{b}$ , we can extend $h$ to the fresh variables in $\mathbf{y}$ so that $h(\mathbf{y})=\mathbf{b}$ and thus $R_{p(\mathbf{z})}(h(\mathbf{y}))\in I^{\prime}$ . Doing this for all $q_{i}$ yields the desired $h^{\prime}$ .

Now for the cases $\mathcal{Q}\in\{\text{MDLog},\text{DLog}\}$ . We treat these cases in one since our construction preserves the width of Datalog-rewritings. In fact, this construction is very similar to the case $\mathcal{Q}=\text{UCQ}$ , so we only give a sketch.

For Point 1, let $\Gamma_{\Pi_{S}}$ be a Datalog-rewriting of $\Pi_{S}$ . We construct a Datalog program $\Gamma_{\Pi}$ of the same width over EDB schema $\mathbf{S}_{E}$ by modifying the EDB part of each rule body in the same way in which we had modified the UCQ-rewriting $q_{\Pi_{S}}$ in the case $\mathcal{Q}=\text{UCQ}$ : replace every EDB-atom $R_{q(\mathbf{x})}(\mathbf{y})$ with $q[\mathbf{y}/\mathbf{x}]$ . We then have $I\models\Pi$ iff $I^{\prime}\models\Pi_{S}$ (by Point 1 of Theorem 1) iff $I^{\prime}\models\Gamma_{\Pi_{S}}$ (by choice of $\Gamma_{\Pi_{S}}$ ) iff $I\models\Gamma_{\Pi}$ . The latter is by construction of $I^{\prime}$ and of $\Gamma_{\Pi}$ . To prove it in more detail, it suffices to show that for every extension $J$ of $I$ to the IDB relations in $\Gamma_{\Pi_{S}}$ with corresponding extension $J^{\prime}$ of $I^{\prime}$ , and every rule body $q$ in $\Gamma_{\Pi_{S}}$ which was translated into a rule body $q^{\prime}$ in $\Gamma_{\Pi}$ , we have $q\rightarrow J$ iff $q^{\prime}\rightarrow J^{\prime}$ . The arguments needed are as in the case $\mathcal{Q}=\text{UCQ}$ .

The proof of Point 2 can be adapted from UCQs to Datalog in an analogous way. Let $\Gamma_{\Pi}$ be a Datalog-rewriting of $\Pi$ . We construct a Datalog program $\Gamma_{\Pi_{S}}$ of the same width over EDB schema $\mathbf{S}_{E^{\prime}}$ . The rules in $\Gamma_{\Pi_{S}}$ are obtained by taking a rule

[TABLE]

from $\Gamma_{\Pi}$ , where the $P_{i}$ are IDB and $q(\mathbf{y})$ is a CQ over schema $\mathbf{S}_{E}$ , converting $q(\mathbf{y})$ into a CQ $q^{\prime}(\mathbf{y}^{\prime})$ over schema $\mathbf{S}^{\prime}_{E}$ in two steps, in the same way in which a CQ over $\mathbf{S}_{E}$ was converted into a CQ over $\mathbf{S}^{\prime}_{E}$ in the case $\mathcal{Q}=\text{UCQ}$ , and then including in $\Gamma_{\Pi_{S}}$ the rule

[TABLE]

The crucial step in the correctness proof is to show that $I\models\Gamma_{\Pi}$ implies $I^{\prime}\models\Gamma_{\Pi_{S}}$ for any $\mathbf{S}^{\prime}_{E}$ -instance $I^{\prime}$ and corresponding $\mathbf{S}_{E}$ -instance $I$ . The arguments are again the same as in the case $\mathcal{Q}=\text{UCQ}$ , the main difference being that we need to consider extensions of $I$ and $I^{\prime}$ to IDB relations from $\Gamma_{\Pi}$ instead of working with $I$ and $I^{\prime}$ themselves.

Point 2 of Lemma 2 only yields a rewriting of $\Pi_{S}$ on $\mathbf{S}^{\prime}_{E}$ -instances of high girth. We next show that, for $\mathcal{Q}\in\{\text{UCQ},\text{MDLog}\}$ , the existence of a $\mathcal{Q}$ -rewriting on instances of high girth implies the existence of a $\mathcal{Q}$ -rewriting that works on instances of unrestricted girth. Whether the same is true for $\mathcal{Q}=\text{Datalog}$ remains as an open problem. We need the following well-known lemma that goes back to Erdös and was adapted to CSPs by Feder and Vardi. Informally, it says that every instance can be ‘exploded’ into an instance of high girth that behaves similarly regarding homomorphisms.

Lemma 3.

For every instance $I$ and $g,s\geq 0$ , there is an instance $I^{\prime}$ (over the same schema) such that $I^{\prime}\rightarrow I$ , $I^{\prime}$ has girth exceeding $g$ , and for every instance $T$ of size at most $s$ , we have $I\rightarrow T$ iff $I^{\prime}\rightarrow T$ .

Feder and Vardi additionally show that $I^{\prime}$ can be constructed by a randomized polynomial time reduction that was later derandomized by Kun [Kun13], but here we do not rely on such computational properties. Every CQ $q$ can be viewed as an instance $I_{q}$ by using the variables as constants and the atoms as facts. It thus makes sense to speak about tree decompositions of CQs and about their treewidth, and it is clear what we mean by saying that a CQ is a tree (that is, has girth $\omega$ ).

Lemma 4.

Let $S$ be a set of templates over schema $\mathbf{S}_{E}$ , $g\geq 0$ , and $\mathcal{Q}\in\{\text{UCQ},\text{MDLog}\}$ . If coCSP( $S$ ) is $\mathcal{Q}$ -rewritable on instances of girth exceeding $g$ , then it is $\mathcal{Q}$ -rewritable.

Proof 4.3.

We start with $\mathcal{Q}=\text{UCQ}$ . Let $q_{g}$ be a UCQ that defines coCSP $(S)$ on instances of girth exceeding $g$ , and let $q$ be the UCQ that consists of all contractions of a CQ in $q_{g}$ that are a tree CQ. We show that $q$ defines coCSP $(S)$ on unrestricted $\mathbf{S}_{E}$ -instances.

Let $I$ be an $\mathbf{S}_{E}$ -instance. First assume that $I\not\rightarrow S$ . By Lemma 3, there is an $\mathbf{S}_{E}$ -instance $I^{\prime}$ of girth exceeding $g$ and also exceeding the number of variables in each CQ in $q_{g}$ and satisfying $I^{\prime}\rightarrow I$ and $I^{\prime}\not\rightarrow S$ . Thus $I^{\prime}\models q_{g}$ , that is, there is a CQ $q^{\prime}$ in $q_{g}$ and a homomorphism $h$ from $q^{\prime}$ to $I^{\prime}$ . Let $q^{\prime\prime}$ be the contraction of $q^{\prime}$ obtained by identifying variables $x$ and $y$ if $h(x)=h(y)$ . Thus, $h$ is an injective homomorphism from $q^{\prime\prime}$ to $I^{\prime}$ . Since the girth of $I^{\prime}$ exceeds the number of variables in $q^{\prime\prime}$ , $q^{\prime\prime}$ must be a tree. Consequently, $q^{\prime\prime}$ is a CQ in $q$ and we have $I^{\prime}\models q$ . From $I^{\prime}\rightarrow I$ , we obtain $I\models q$ .

Now assume that $I\models q$ . Then, there is a tree CQ $q^{\prime}$ in $q$ such that $q^{\prime}\rightarrow I$ . When we view $q^{\prime}$ as an $\mathbf{S}_{E}$ -instance $I_{q^{\prime}}$ , then clearly $I_{q^{\prime}}\models q_{g}$ and $I_{q^{\prime}}$ has girth exceeding $k$ . Thus, $q^{\prime}\not\rightarrow S$ , and from $q^{\prime}\rightarrow I$ we obtain $I\not\rightarrow S$ .

Now for the case $\mathcal{Q}=\text{MDLog}$ . Let $\Gamma_{g}$ be an MDLog program that defines coCSP $(S)$ on instances of girth exceeding $g$ . Let $\Gamma$ be the program obtained from $\Gamma_{g}$ by replacing every rule $P(x)\leftarrow q(\mathbf{x})$ with all rules $P(x)\leftarrow q^{\prime}(\mathbf{x}^{\prime})$ such that $q^{\prime}(\mathbf{x}^{\prime})$ is a tree CQ that is a contraction of $q(\mathbf{x})$ . We show that $\Gamma$ is an MDLog-definition of coCSP $(S)$ on instances of unrestricted girth.

Let $I$ be an $\mathbf{S}_{E}$ -instance. First assume that $I\not\rightarrow S$ . By Lemma 3, there is an $\mathbf{S}_{E}$ -instance $I^{\prime}$ whose girth exceeds $g$ and also exceeds the diameter of $\Gamma_{g}$ and that satisfies $I^{\prime}\rightarrow I$ and $I^{\prime}\not\rightarrow S$ . The latter yields $I^{\prime}\models\Gamma_{g}$ . It remains to show that this implies $I^{\prime}\models\Gamma$ since with $I^{\prime}\rightarrow I$ , this yields $I\models\Gamma$ as required.

To show that $I^{\prime}\models\Gamma$ follows from $I^{\prime}\models\Gamma_{g}$ , it suffices to show that all IDB facts derived by $\Gamma_{g}$ starting from $I^{\prime}$ are also derived by $\Gamma$ . Thus let $J^{\prime}$ be an extension of $I^{\prime}$ to the IDBs in $\Gamma_{g}$ . It is enough to show that when a single application of a rule from $\Gamma_{g}$ in $J^{\prime}$ yields an IDB atom $P(a)$ , then $\Gamma$ can derive the same atom. The former is the case only if $\Gamma_{g}$ contains a rule $P(x)\leftarrow q(\mathbf{x})$ such that there is a homomorphism $h$ from $q(\mathbf{x})$ to $J^{\prime}$ with $h(x)=a$ . Let $q^{\prime}(\mathbf{x}^{\prime})$ be the contraction of $q(\mathbf{x})$ obtained by identifying variables $x$ and $y$ when $h(x)=h(y)$ . Since the girth of $I^{\prime}$ exceeds the diameter of $\Gamma_{g}$ , $q^{\prime}(\mathbf{x}^{\prime})$ is a tree. Thus, $\Gamma$ contains the rule $P(x)\leftarrow q^{\prime}(\mathbf{x}^{\prime})$ and the application of this rule in $J^{\prime}$ enabled by $h$ yields $P(a)$ . We have thus shown $I^{\prime}\models\Gamma$ and are done.

Now assume that $I\models\Gamma_{g}$ . Then there is a proof tree for ${\mathtt{goal}}()$ from $I$ and $\Gamma_{g}$ , see [AHV95] for details. From that tree, we can read off an $\mathbf{S}_{E}$ -instance $O$ such that $O\rightarrow I$ , $O\models\Gamma_{g}$ , and, since $\Gamma_{g}$ is monadic and only comprises rules with tree-shaped bodies, $O$ is a tree. Thus, $O$ has girth exceeding $g$ and from $O\models\Gamma_{g}$ we get $O\not\rightarrow S$ . But with $O\rightarrow I$ , this yields $I\not\rightarrow S$ as required.

Putting together Theorems 1 and 1, Proposition 1, and Lemmas 2 and 4, we obtain the following reductions of rewritability of Boolean MDDLog programs to CSP rewritability.

Proposition 5.

Every Boolean MDDLog program $\Pi$ can be converted into a set of templates $S_{\Pi}$ such that

(1)

$\Pi$ * is $\mathcal{Q}$ -rewritable iff coCSP* $(S_{\Pi})$ * is $\mathcal{Q}$ -rewritable for every $\mathcal{Q}\in\{\text{FO},\text{UCQ},\text{MDLog}\}$ ;* 2. (2)

every $\mathcal{Q}$ -rewriting of $\Pi$ can be effectively translated into a $\mathcal{Q}$ -rewriting of coCSP $(S_{\Pi})$ * and vice versa, for every $\mathcal{Q}\in\{\text{UCQ},\text{MDLog}\}$ .* 3. (3)

$|S_{\Pi}|\leq 2^{2^{p(n)}}$ * and $|T|\leq 2^{2^{p(n)}}$ for each $T\in S_{\Pi}$ , $n$ the size of $\Pi$ and $p$ a polynomial.*

The construction takes time polynomial in $\sum_{T\in S_{\Pi}}|T|$ .

FO-rewritability of CSPs (and their complements) is NP-complete [LLT07] and it was observed in [BtCLW14] that the upper bound lifts to generalized CSPs. MDLog-rewritability of coCSPs is NP-hard and in ExpTime [CL17]. We show in Appendix B that also this upper bound lifts to generalized coCSPs. Together with Proposition 5, this yields the upper bounds in the following theorem. The lower bounds are from [BL16].

Theorem 6.

For Boolean MDDLog programs and the complement of MMSNP sentences,

(1)

FO-rewritability (equivalently: UCQ-rewritability) is 2NExpTime*-complete;* 2. (2)

MDLog-rewritability is in 3ExpTime* (and 2NExpTime-hard).*

5. Shape of Rewritings, Obstructions, Explosion

In the FO case, it is possible to extract from the approach in the previous section an algorithm that computes actual rewritings, if they exist. However, that algorithm is hardly practical. An important first step towards the design of more practical algorithms that compute rewritings (in an exact or in an approximative way) is to analyze the shape of rewritings. In fact, both FO- and MDLog-rewritings of coCSPs are known to be of a rather restricted shape, far from exploiting the full expressive power of the target languages. In this section, we establish corresponding results for Boolean MDDLog. This topic is closely related to the theory of obstructions, so we also establish connections between the rewritability of MMSNP sentences and natural obstruction sets. Finally, we observe an MMSNP counterpart of Lemma 3, the fundamental ‘explosion’ lemma for CSPs.

The following summarizes our results regarding the shape of rewritings.

Theorem 7.

Let $\Pi$ be a Boolean MDDLog program of diameter $k$ . Then

(1)

if $\Pi$ is FO-rewritable, then it has a UCQ-rewriting in which each CQ has treewidth $(1,k)$ ; 2. (2)

if $\Pi$ is MDLog-rewritable, then it has an MDLog-rewriting of diameter $k$ .

Proof 5.1.

We analyze the proof of Lemma 2 and use known results from CSP. In fact, any FO-rewritable coCSP has a UCQ-rewriting that consists of tree CQs [NT00], and thus the same holds for simple Boolean MDDLog programs. If we convert such a rewriting of $\Pi_{S}$ into a rewriting of $\Pi$ as in the proof of Lemma 2, we obtain a UCQ-rewriting in which each CQ has treewidth $(1,k)$ . For Point 2 of Theorem 7, one uses the proof of Lemma 2 and the known fact that every MDLog-rewritable CSP has an MDLog-rewriting in which every rule body comprises at most one EDB atom, see e.g. the proof of Theorem 19 in [FV98].

In a sense, the concrete bound $k$ in Points 1 and 2 of Theorem 7 is quite remarkable. Point 2 says, for example, that when eliminating disjunctions from a Boolean MDDLog program, it is never necessary to increase the diameter!

We now consider obstructions. An obstruction set $\mathcal{O}$ for a CSP template $T$ over schema $\mathbf{S}_{E}$ is a set of instances over the same schema such that for any $\mathbf{S}_{E}$ -instance $I$ , we have $I\not\rightarrow T$ iff $O\rightarrow I$ for some $O\in\mathcal{O}$ . The elements of $\mathcal{O}$ are called obstructions. A lot is known about CSP obstructions. For example, $T$ is FO-rewritable if and only if it has a finite obstruction set [Ats08] if and only if it has a finite obstruction set that consists of finite trees [NT00], and $T$ is MDLog-rewritable if and only if it has a (potentially infinite) obstruction set that consists of finite trees [FV98]. Here we consider obstruction sets for MMSNP, defined in the obvious way: an obstruction set $\mathcal{O}$ for an MMSNP sentence $\theta$ over schema $\mathbf{S}_{E}$ is a set of instances over the same schema such that for any $\mathbf{S}_{E}$ -instance $I$ , we have $I\not\models\theta$ iff $O\rightarrow I$ for some $O\in\mathcal{O}$ . This should not be confused with colored forbidden patterns used to characterize MMSNP in [MS07]. The following characterizes FO-rewritability of MMSNP sentences in terms of obstruction sets.

Corollary 8.

For every MMSNP sentence $\theta$ , the following are equivalent:

(1)

$\theta$ * is FO-rewritable;* 2. (2)

$\theta$ * has a finite obstruction set;* 3. (3)

$\theta$ * has a finite set of finite obstructions of treewidth $(1,k)$ .*

Corollary 8 follows from Point 1 of Theorem 7 and the straightfoward observations that an MMSNP sentence $\theta$ is FO-rewritable iff $\neg\theta$ is (which is equivalent to a Boolean MDDLog program) and that every finite obstruction set $\mathcal{O}$ for $\theta$ gives rise to a UCQ-rewriting $\bigvee\mathcal{O}$ of $\neg\theta$ and vice versa. We now turn to MDLog-rewritability.

Proposition 9.

Let $\theta$ be an MMSNP sentence of diameter $k$ . Then $\neg\theta$ is MDLog-rewritable iff $\theta$ has a set of obstructions (equivalently: finite obstructions) that are of treewidth $(1,k)$ .

Proof 5.2.

The “only if” direction is a consequence of Point 2 of Theorem 7 and the fact that, for any Boolean monadic Datalog program $\Pi\equiv\neg\theta$ of diameter $k$ over EDB schema $\mathbf{S}_{E}$ , a proof tree for ${\mathtt{goal}}()$ from an $\mathbf{S}_{E}$ -instance $I$ and $\Pi$ gives rise to a finite $\mathbf{S}_{E}$ -instance $J$ of treewidth $(1,k)$ with $J\rightarrow I$ . The desired obstruction set for $\neg\theta$ is then the set of all these $J$ . The “if” direction is a consequence of Theorem 5 in [BD13].

We remark that the results in [BD13] almost give Proposition 9, but do not seem to deliver any concrete bound on the parameter $k$ of the treewidth of obstruction sets.

We close with observing an MMSNP counterpart of the ‘explosion’ Lemma 3, first giving a preliminary. Let $I$ be an instance over some schema $\mathbf{S}_{E}$ . A $(1,k)$ -decomposition of $I$ is a pair $(V,(I_{v})_{v\in V})$ where $V$ is a set of indices and $(I_{v})_{v\in V}$ is a partition of $I$ such that for all distinct $v,v^{\prime}\in V$ , $|\mathsf{dom}(I_{v})\cap\mathsf{dom}(I_{v}^{\prime})|\leq 1$ and $|\mathsf{dom}(I_{v})|\leq k$ . Thus, a $(1,k)$ -decomposition $D=(V,(I_{v})_{v\in V})$ decomposes $I$ into parts of size at most $k$ and with little overlap. These parts can be viewed as the facts of an instance $I_{D}$ over an aggregation schema $\mathbf{S}^{\prime}_{E}$ defined by the relations $R_{q_{v}(\mathbf{x})}$ where $q_{v}(\mathbf{x})$ is $I_{v}$ viewed as a CQ, that is,

[TABLE]

where we assume some fixed (but otherwise irrelevant) order on the elements of each $\mathsf{dom}(I_{v})$ . Now, we say that $I$ has $(1,k)$ -decomposition girth $g$ if $g$ is the supremum of the girths of $I_{D}$ , for all $(1,k)$ -decompositions $D$ of $I$ . It can be shown that $I$ has $(1,k)$ -decomposition girth $\omega$ if and only if it has treewidth $(1,k)$ .

Here comes the announced MMSNP counterpart of Lemma 3.

Lemma 10.

For every instance $I$ and $g\geq s>0$ , and every MDDLog program $\Pi$ of diameter at most $s$ , there is an instance $J$ (over the same schema) such that $J\rightarrow I$ , $J$ has $(1,s)$ -decomposition girth exceeding $g$ , and $I\models\Pi$ iff $J\models\Pi$ .

Proof 5.3.

Let $\Pi$ be a Boolean MDDLog program of diameter $k\leq s$ over EDB schema $\mathbf{S}_{E}$ . By Theorems 1 and 1, there is a $k$ -aggregation schema $\mathbf{S}^{\prime}_{E}$ and a set of templates $S_{\Pi}$ over $\mathbf{S}^{\prime}_{E}$ such that:

(1)

for any $\mathbf{S}_{E}$ -instance $I$ with corresponding $\mathbf{S}^{\prime}_{E}$ -instance $I^{\prime}$ , $I\models\Pi$ iff $I^{\prime}\not\rightarrow S_{\Pi}$ ; 2. (2)

for any $\mathbf{S}^{\prime}_{E}$ -instance $I^{\prime}$ whose girth exceeds $k$ with corresponding $\mathbf{S}_{E}$ -instance $I$ , $I^{\prime}\not\rightarrow S_{\Pi}$ iff $I\models\Pi$ .

Let $I$ and $I^{\prime}$ be an $\mathbf{S}_{E}$ -instance and its corresponding $\mathbf{S}^{\prime}_{E}$ -instance. Furthermore, let $J^{\prime}$ be the $\mathbf{S}^{\prime}_{E}$ -instance obtained from $I^{\prime}$ by applying Lemma 3 with $s=\max\{|T|\mid T\in S_{\Pi}\}$ and $g$ as given. Then $J^{\prime}\rightarrow I^{\prime}$ , $J^{\prime}$ has girth exceeding $g$ , and $J^{\prime}\rightarrow S_{\Pi}$ iff $I^{\prime}\rightarrow S_{\Pi}$ iff $I\not\models\Pi$ . Let $J$ be the $\mathbf{S}_{E}$ -instance corresponding to $J^{\prime}$ . As $J^{\prime}$ has girth exceeding $k$ , Point (2) above yields $J\models\Pi$ iff $J^{\prime}\not\rightarrow S_{\Pi}$ . In summary, we thus obtain $I\models\Pi$ iff $J\models\Pi$ .

It thus remains to show that $J$ has $(1,s)$ -decomposition girth exceeding $g$ and that $J\rightarrow I$ . The former is witnessed by the $(1,k)$ -decomposition $D=(V,(I_{v})_{v\in V})$ of $J$ obtained by using as $V$ the facts of $J^{\prime}$ and as $I_{v}$ the set of facts obtained from fact $v$ during the construction of $J$ .

As the last step, we argue that $J\rightarrow I$ follows from $J^{\prime}\rightarrow I^{\prime}$ , and that in fact any homomorphism $h$ from $J^{\prime}$ to $I^{\prime}$ is also a homomorphism from $J$ to $I$ . Thus let $h$ be such a homomorphism. For any fact $R(a_{i_{1}},\dots,a_{i_{\ell}})$ in $J$ , there is a fact $R_{q(x_{1},\dots,x_{n})}(a_{1},\dots,a_{m})$ in $J^{\prime}$ such that $R(x_{i_{1}},\dots,x_{i_{\ell}})\in q_{i}(x_{1},\dots,x_{n})$ . We have $R_{q(x_{1},\dots,x_{n})}(h(a_{1}),\dots,h(a_{m}))\in I^{\prime}$ . By definition of $I^{\prime}$ , this means $R(h(a_{i_{1}}),\dots,h(a_{i_{\ell}}))\in I$ and we are done.

We believe that Lemma 10 can be useful in many contexts, saving a detour via CSPs. For example, it enables an alternative proof of Theorem 7. We illustrate this for Point 1. We can start with a UCQ-rewriting $q$ of an MDDLog program $\Pi$ of diameter $k$ and show that the UCQ $q_{t}$ that consists of all CQs of treewidth $(1,k)$ that are a contraction of a CQ in $q$ must also be a rewriting of $\Pi$ : take an instance $I$ that makes $\Pi$ true, use Lemma 10 to transform $I$ to an $I^{\prime}$ of girth exceeding $k$ and also exceeding the size of any CQ in $q$ such that $I^{\prime}\models\Pi$ and $I^{\prime}\rightarrow I$ , observe that $I^{\prime}\models q$ and that a homomorphism from any CQ $p$ in $q$ to $I^{\prime}$ gives rise to a homomorphism from a CQ $p^{\prime}$ in $q_{t}$ to $I^{\prime}$ , and derive $p^{\prime}\rightarrow I$ from $I^{\prime}\rightarrow I$ .

6. Datalog-Rewritability of Boolean MDDLog Programs

and Canonical Datalog Programs

We consider rewriting Boolean MDDLog programs into Datalog programs, making two contributions. First, we show that Datalog-rewritability is decidable for programs that have equality, a condition that is defined in detail below. For programs that do not have equality, the same construction yields a procedure that is sound, but whose completeness remains an open problem. And second, we give a new and direct construction of canonical Datalog-rewritings of Boolean MDDLog programs (equivalently: the complements of MMSNP sentences), bypassing the construction of infinite templates [BD13] which involves the application of a non-trivial construction due to Cherlin, Shelah, and Shi [CSS99]. This construction is potentially useful even though it is yet unknown whether Datalog-rewritability of MDDLog programs $\Pi$ is decidable (for programs that do not have equality): when $\Pi$ is not rewritable, then the canonical Datalog-rewriting is the best possible approximation of $\Pi$ in terms of a Datalog program (of given width and diameter).

6.1. Datalog-Rewritability of Boolean MDDLog Programs

A CSP template $T$ has equality if its EDB schema includes the distinguished binary relation ${\mathtt{eq}}$ and $T$ interprets ${\mathtt{eq}}$ as the relation $\{(a,a)\mid a\in\mathsf{dom}(T)\}$ . Thus, ${\mathtt{eq}}$ is an extremely natural kind of constraint: a fact ${\mathtt{eq}}(a,b)$ in the input instance means that $a$ and $b$ must be mapped to the same template element; spoken from the perspective of constraint satisfaction, they are variables that must receive the same value.

In accordance with the above, we say that an MDDLog program $\Pi$ has equality if its EDB schema includes the distinguished binary relation ${\mathtt{eq}}$ , $\Pi$ contains the rules

[TABLE]

for each IDB relation $P$ , and these are the only rules that mention ${\mathtt{eq}}$ . Thus, a fact ${\mathtt{eq}}(a,b)$ in the input instance says that the same IDB relations can be derived by $\Pi$ for $a$ and for $b$ . It can be verified that when an MDDLog program that has equality is converted into a generalized CSP based on a set of templates $S_{\Pi}$ according to Theorems 1 and 1 (using the concrete constructions in the appendix), then all templates in $S_{\Pi}$ have equality.

We aim to show decidability of the Datalog-rewritability of MDDLog programs that have equality following the strategy that we have used for rewritability into FO and into MDLog in Section 4. We thus need a counterpart of Lemma 4, that is, we have to show that for all templates $T$ that have equality, Datalog-rewritability of coCSP $(T)$ on instances of high girth implies unrestricted Datalog-rewritability. It is here that having equality is an advantage. In particular, every input instance for coCSP $(T)$ can be made high girth preserving (non-)homomorphisms to $T$ by introducing additional ${\mathtt{eq}}$ -facts. This is similar in spirit to the explosion Lemma 3, but the construction is much simpler than in the proof of that lemma. We next make it explicit.

Let $I$ be an $\mathbf{S}_{E}$ -instance and let $g\geq 0$ . We use ${\mathtt{pos}}(I)$ to denote the set of pairs $(R(\mathbf{a}),i)$ such that $R(\mathbf{a})\in I$ and $i\in{\mathtt{pos}}(R)$ . In what follows, for any tuple of constants $\mathbf{a}$ , we use $a_{i}$ to denote its $i$ -th component. Reserve fresh constants as follows:

•

a constant $b_{p}$ , for all $p=(R(\mathbf{a}),i)\in{\mathtt{pos}}(I)$ ;

•

$g$ constants $b_{p,p^{\prime},1},\dots,b_{p,p^{\prime},g}$ , for all $p,p^{\prime}=(R(\mathbf{a}),i),(R^{\prime}(\mathbf{a}^{\prime}),i^{\prime})\in{\mathtt{pos}}(I)$ with $a_{i}=a^{\prime}_{i^{\prime}}$ .

Define an instance $I^{g}$ that consists of the following facts:

(1)

for every $R(\mathbf{a})\in I$ with $R$ of arity $n$ , the fact $R(b_{p_{1}},\dots,b_{p_{n}})$ where $p_{i}=(R(\mathbf{a}),i)$ ; 2. (2)

for all distinct $p,p^{\prime}=(R(\mathbf{a}),i),(R^{\prime}(\mathbf{a}^{\prime}),i^{\prime})\in{\mathtt{pos}}(I)$ with $a_{i}=a^{\prime}_{i^{\prime}}$ , the facts

[TABLE]

Observe that $I^{g}$ has girth exceeding $g$ . Moreover, it satisfies the following crucial property.

Lemma 11.

For every CSP template $T$ over $\mathbf{S}_{E}$ that has equality, $I^{g}\rightarrow T$ iff $I\rightarrow T$ .

Proof 6.1.

Let $T$ be a template over $\mathbf{S}_{E}$ that has equality. We have to show that there is a homomorphism $h$ from $I$ to $T$ iff there is a homomorphism $h_{g}$ from $I^{g}$ to $T$ . In fact, $h_{g}$ can be obtained from $h$ by setting $h_{g}(b_{p})=h_{g}(b_{p,p^{\prime},j})=h(a_{i})$ when $p=(R(\mathbf{a}),i)$ ; conversely, $h$ can be obtained from $h_{g}$ by setting $h(a_{i})=h_{g}(b_{p})$ when $p=(R(\mathbf{a}),i)$ —the latter is well-defined by construction of $I^{g}$ and since ${\mathtt{eq}}$ is interpreted as the reflexive relation in $T$ .

We are now ready to establish the announced counterpart of Lemma 4.

Lemma 12.

Let $S$ be a set of templates over schema $\mathbf{S}_{E}$ that have equality, and let $g\geq 0$ . If coCSP( $S$ ) is DLog-rewritable on instances of girth exceeding $g$ , then it is DLog-rewritable.

Proof 6.2.

Assume that coCSP $(S)$ is DLog-rewritable on instances of girth exceeding $g$ and let $\Gamma$ be a concrete rewriting. We construct a Datalog program $\Gamma^{\prime}$ such that for any $\mathbf{S}_{E}$ -instance $I$ , $I\models\Gamma^{\prime}$ iff $I^{g}\models\Gamma$ . Clearly, $\Gamma^{\prime}$ is then a rewriting of coCSP $(S)$ on instances of unrestricted girth.

We aim to construct $\Gamma^{\prime}$ such that it mimics the execution of $\Gamma$ on $I^{g}$ , despite being executed on $I$ . One challenge is that the domains of $I$ and $I^{g}$ are not identical. In $\Gamma^{\prime}$ , the IDB relations of $\Gamma$ need to be adapted to reflect this change of domain, and so do the rules. Let $m$ be the maximum arity of any relation in $\mathbf{S}_{E}$ . Every IDB relation $P$ of $\Gamma$ gives rise to a set of IDB relations in $\Gamma^{\prime}$ . In fact, every position of $P$ can be replaced either with

(1)

$\ell$ * positions, for some $\ell\leq m$ , reflecting the case that the position is filled with a constant $b_{p}$ where $p=(R(\mathbf{a}),i)$ with $R$ $\ell$ -ary; or with* 2. (2)

$\ell+\ell^{\prime}$ * positions, for some $\ell,\ell^{\prime}\leq m$ , reflecting the case that the position is filled with a constant $b_{p,p^{\prime},j}$ where $p=(R(\mathbf{a}),i)$ and $p^{\prime}=(R^{\prime}(\mathbf{a}^{\prime}),i^{\prime})$ , with $R$ $\ell$ -ary and $R^{\prime}$ $\ell^{\prime}$ -ary.*

In Case 1, the $\ell$ positions store the constants in $\mathbf{a}$ . The symbol $R$ and the number $i$ from $p$ also need to be stored, which is done as an annotation to the IDB relation. In Case 2, the first $\ell$ positions store the constants in $\mathbf{a}$ while the latter $\ell^{\prime}$ positions store the constants in $\mathbf{a}^{\prime}$ ; we additionally need to store the symbols $R$ and $R^{\prime}$ , the numbers $i$ and $i^{\prime}$ from $p$ and $p^{\prime}$ , and the number $j$ , which is again done by annotation of the IDB relation.

Let us make this formal. The IDB relations of $\Gamma^{\prime}$ take the form $P^{\mu}$ where $P$ is an IDB relation of $\Gamma$ and $\mu$ is a function from ${\mathtt{pos}}(P)$ to

[TABLE]

such that if $\mu(\ell)=(R,i)$ , then $i\in{\mathtt{pos}}(R)$ and if $\mu(\ell)=(R,i,R^{\prime},i^{\prime},j)$ , then $i\in{\mathtt{pos}}(R)$ and $i^{\prime}\in{\mathtt{pos}}(R^{\prime})$ . The arity of $P^{\mu}$ is $\sum_{\ell=1..{\mathtt{pos}}(P)}q_{\ell}$ where $q_{\ell}$ is the arity of $R$ if $\mu(\ell)=(R,i)$ and $q_{\ell}$ is the sum of the arities of $R$ and $R^{\prime}$ if $\mu(\ell)=(R,i,R^{\prime},i^{\prime},j)$ . In the construction of $\Gamma^{\prime}$ , we manipulate the rules of $\Gamma$ to account for this change in the IDB schema. We can assume w.l.o.g. that $\Gamma$ is closed under contractions of rules. Let

[TABLE]

be a rule in $\Gamma$ where $P_{0},\dots,P_{\ell_{1}}$ are IDB and $R_{1},\dots,R_{\ell_{2}}$ are EDB (possibly the distinguished ${\mathtt{eq}}$ relation), such that

( $*$ )

every variable occurs at most once in $R_{1}(\mathbf{y}_{1})\wedge\cdots\wedge R_{\ell_{2}}(\mathbf{y}_{\ell_{2}})$ .

Note that it might be possible to write a single rule from $\Gamma$ in the above form in more than one way because ${\mathtt{eq}}$ -atoms can be placed in the second line or in the third line; we then consider all possible ways. Informally, this choice corresponds to the decision whether the ${\mathtt{eq}}$ -atom is mapped to an ${\mathtt{eq}}$ -fact in $I^{g}$ that comes from an ${\mathtt{eq}}$ -fact in $I$ (Point 1 of the definition of $I^{g}$ ) or to a freh ${\mathtt{eq}}$ -fact (Point 2 of the definition of $I^{g}$ ). Also note that rules that do not satisfy ( $*$ ) can be ignored since they never apply in $I^{g}$ .

Let $\mathbf{x}$ be the variables in the rule, and let $\delta:\mathbf{x}\rightarrow\Omega$ be such that the following conditions are satisfied:

(1)

for each $R_{i}(\mathbf{y}_{i})$ with $\mathbf{y}_{i}=y_{1}\cdots y_{k}$ , we have $\delta(y_{j})=(R_{i},j)$ for all $j$ ; 2. (2)

for each ${\mathtt{eq}}(z_{i,1},z_{i,2})$ , one of the following is true for some $R,i,R^{\prime},i^{\prime},j$ :

(a)

$\delta(z_{i,1})=(R,i)$ * and $\delta(z_{i,2})=(R,i,R^{\prime},i^{\prime},1)$ ;* 2. (b)

$\delta(z_{i,1})=(R,i,R^{\prime},i^{\prime},g)$ * and $\delta(z_{i,2})=(R^{\prime},i^{\prime})$ ;* 3. (c)

$\delta(z_{i,1})=(R,i,R^{\prime},i^{\prime},j)$ * and $\delta(z_{i,2})=(R,i,R^{\prime},i^{\prime},j\pm 1)$ .*

With each variable $x$ in $\mathbf{x}$ , we associate a tuple $\mathbf{u}_{x}$ of distinct variables. If $\delta(x)$ is of the form $(R,i)$ , then the length of $\mathbf{u}_{x}$ matches the arity of $R$ and $\mathbf{u}_{x}$ is called a variable block. If $\delta(x)$ is of the form $(R,i,R^{\prime},i^{\prime},j)$ , then the length of $\mathbf{u}_{x}$ is the sum of the arities $n$ and $n^{\prime}$ of $R$ and $R^{\prime}$ ; the first $n$ variables in $\mathbf{u}_{x}$ are then also called a variable block, and so are the last $n^{\prime}$ variables. Variable blocks will either be disjoint or identical. Identities are minimized such that the following conditions are satisfied:

(I1)

if $x$ occurs in some $\mathbf{y}_{i}$ , then $\mathbf{u}_{x}=\mathbf{y}_{i}$ ; 2. (I2)

if Case 2a applies to ${\mathtt{eq}}(z_{i,1},z_{i,2})$ , then $\mathbf{u}_{z_{i,1}}$ is identical to the first variable block in $\mathbf{u}_{z_{i,2}}$ ; 3. (I3)

if Case 2b applies to ${\mathtt{eq}}(z_{i,1},z_{i,2})$ , then $\mathbf{u}_{z_{i,2}}$ is identical to the second variable block in $\mathbf{u}_{z_{i,1}}$ ; 4. (I4)

if Case 2c applies to ${\mathtt{eq}}(z_{i,1},z_{i,2})$ , then the first variable blocks of $\mathbf{u}_{z_{i,1}}$ and $\mathbf{u}_{z_{i,2}}$ are identical, and so are the second variable blocks.

Regarding (I1), note that $x$ cannot occur in more than one $\mathbf{y}_{i}$ because of ( $*$ ), thus the condition can always be satisfied ‘without conflicts’. Then include in $\Gamma^{\prime}$ the rule

[TABLE]

such that

(R1)

if the $k$ -th component in $\mathbf{x}_{0}$ is $x$ , then $\mu_{i}(k)=\delta(x)$ ; 2. (R2)

$\mathbf{x}^{\prime}_{i}$ * is obtained from $\mathbf{x}_{i}$ by replacing each variable $x$ with $\mathbf{u}_{x}$ ;* 3. (R3)

$W$ * contains the following atoms:*

•

for each variable $x\in\mathbf{x}$ with $\delta(x)$ of the form $(R,i)$ , an atom $R(\mathbf{w})$ where the $i$ -th component of $\mathbf{w}$ is $x$ and all other variables are distinct and fresh;

•

for each variable $x\in\mathbf{x}$ with $\delta(x)$ of the form $(R,i,R^{\prime},i^{\prime},j)$ , atoms $R(\mathbf{w}),R^{\prime}(\mathbf{w}^{\prime})$ where the $i$ -th component of $\mathbf{w}$ and the $i^{\prime}$ -th component of $\mathbf{w}^{\prime}$ is $x$ and all other variables are distinct and fresh.

As an example, consider the following rule in $\Gamma$ :

[TABLE]

where $R$ is EDB and $P$ IDB, and let $\delta(x_{1})=(R,1)$ , $\delta(x_{2})=(R,2)$ , and $\delta(x_{3})=(R,2,R,1,1)$ . Note that Case 2a applies to ${\mathtt{eq}}(x_{2},x_{3})$ . We have $\mathbf{u}_{x_{1}}=\mathbf{u}_{x_{2}}=x_{1}x_{2}$ and $\mathbf{u}_{x_{3}}=x_{1}x_{2}u_{1}u_{2}$ and thus obtain the following rule in $\Gamma^{\prime}$ :

[TABLE]

where the last line corresponds to $W$ above, and where $\mu(1)=(R,1)$ and $\mu(2)=(R,2,R,1,1)$ .

We have have to show that $I\models\Gamma^{\prime}$ iff $I^{g}\models\Gamma$ for any $\mathbf{S}_{E}$ -instance $I$ . There is a correspondence between extensions of $I^{g}$ to the IDB relations in $\Gamma$ and extensions of $I$ to the IDB relations in $\Gamma^{\prime}$ . More precisely, a fact $P^{\mu}(\mathbf{a})$ in an extension of $I$ represents a fact $P(\mathbf{b})$ in an extension of $I^{g}$ as follows (and vice versa): for each $i\in{\mathtt{pos}}(P)$ , let $\mathbf{a}_{i}$ be the subtuple of $\mathbf{a}$ that starts at position $\sum_{\ell=1..i-1}q_{\ell}$ and is of length $q_{i}$ (where, as before, $q_{\ell}$ is the arity of $R$ if $\mu(\ell)=(R,i)$ and $q_{\ell}$ is the sum of the arities of $R$ and $R^{\prime}$ if $\mu(\ell)=(R,i,R^{\prime},i^{\prime},j)$ ); the $i$ -th constant in $\mathbf{b}$ is $b_{R(\mathbf{a}_{i}),j}$ if $\mu(i)=(R,j)$ and $b_{R(\mathbf{c}),j,R^{\prime}(\mathbf{c}^{\prime}),j^{\prime},\ell}$ if $\mu(i)=(R,j,R^{\prime},j^{\prime},\ell)$ and $\mathbf{a}_{i}=\mathbf{c}\mathbf{c}^{\prime}$ .

One essentially has to show that every application of a rule from $\Gamma^{\prime}$ in an extension of $I$ can be reproduced by an application of a rule from $\Gamma$ in the corresponding extension of $I^{g}$ , and vice versa. We only sketch the details. First let $J^{g}$ be an extension of $I^{g}$ to the IDB relations in $\Gamma$ and let $P(\mathbf{y})\leftarrow q(\mathbf{x})$ be a rule in $\Gamma$ applicable in $J^{g}$ , and $h$ a homomorphism from $q(\mathbf{x})$ to $J^{g}$ such that $P(h(\mathbf{y}))\notin J^{g}$ . Since $\Gamma$ is closed under contractions of rules, we can assume that $h$ is injective. Let

[TABLE]

such that all $P_{i}$ are IDB, all $R_{i}$ EDB, and an equality atom ${\mathtt{eq}}(x,y)$ is included in the third line if and only if at least one of $h(x)$ and $h(y)$ is not of the form $b_{p}$ . Consequently, for all variables $x$ that occur in the second line, $h(x)$ is of the form $b_{p}$ . One can now verify that Condition ( $*$ ) is satisfied. Assume that this is not the case. The first case is that that there are distinct atoms $R_{i}(\mathbf{y}_{i})$ and $R_{j}(\mathbf{y}_{j})$ that share a variable $x$ . In $I^{g}$ , every constant of the form $b_{p}$ occurs in exactly one fact that only contains constants of the form $b_{p}$ . Thus, $h$ must take $R_{i}(\mathbf{y}_{i})$ and $R_{j}(\mathbf{y}_{j})$ to the same fact in $J^{g}$ . Since $h$ is injective, $R_{i}(\mathbf{y}_{i})$ and $R_{j}(\mathbf{y}_{j})$ must be identical which is a contradiction. The second case is that there is an atom $R_{i}(\mathbf{y}_{i})$ in which a variable occurs more than once. This is in contradiction to $h$ being a homomorphism to $J^{g}$ .

Now define a map $\delta:\mathbf{x}\rightarrow\Omega$ by putting $\delta(x)=p$ if $h(x)=b_{p}$ and $\delta(x)=(p,p^{\prime},i)$ if $h(x)=b_{p,p^{\prime},i}$ . It can be verified that the two conditions required of $\delta$ are satisfied. We thus obtain a corresponding rule in $\Gamma^{\prime}$ . It can be verified that applying this rule in the extension $J$ of $I$ corresponding to $J^{g}$ adds the fact that corresponds to $P(h(\mathbf{y}))$ .

Conversely, let $J$ be an extension of $I$ to the IDB relations in $\Gamma^{\prime}$ and let

[TABLE]

be a rule in $\Gamma^{\prime}$ and $h$ a homomorphism from the rule body to $J$ such that $P^{\mu_{0}}(h(\mathbf{x}^{\prime}_{0}))$ is not in $J$ . This rule was derived from a rule

[TABLE]

in $\Gamma$ and a map $\delta:\mathbf{x}\rightarrow\Omega$ , $\mathbf{x}$ the variables in the latter rule. We define a map $h^{\prime}$ from $\mathbf{x}$ to $\mathsf{dom}(J^{g})$ , where $J^{g}$ is the extension of $I^{g}$ that corresponds to $J$ . Let $x\in\mathbf{x}$ . If $\delta(x)=(R,i)$ and $h(\mathbf{u}_{x})=\mathbf{a}$ , then set $h^{\prime}(x)=b_{R(\mathbf{a}),i}$ . If $\delta(x)=(R,i,R^{\prime},i^{\prime},j)$ and $h(\mathbf{u}_{x})=\mathbf{a}\mathbf{a}^{\prime}$ , then set $h^{\prime}(x)=b_{R(\mathbf{a}),i,R^{\prime}(\mathbf{a}^{\prime}),i^{\prime},j}$ . We argue that $h^{\prime}$ is a homomorphism from the body of the latter rule to $J^{g}$ . There are three cases:

•

Consider an atom $P_{i}(\mathbf{x}_{i})$ . Let $\mathbf{x}_{i}=x_{1}\cdots x_{n}$ . Then there is a corresponding atom $P_{i}^{\mu_{i}}(\mathbf{x}_{i})$ in the former rule and thus $P_{i}^{\mu_{i}}(h(\mathbf{x}_{i}))\in J$ . For each $j\in{\mathtt{pos}}(P_{i})$ , let $\mathbf{a}_{j}$ be the subtuple of $h(\mathbf{x}_{i})$ that starts at position $\sum_{\ell=1..j-1}q_{\ell}$ and is of length $q_{j}$ . Define the tuple $\mathbf{b}$ by letting the $j$ -th constant be $b_{R(\mathbf{a}_{j}),\ell}$ if $\mu(j)=(R,\ell)$ and $b_{R(\mathbf{c}),\ell,R^{\prime}(\mathbf{c}^{\prime}),\ell^{\prime},k}$ if $\mu(j)=(R,\ell,R^{\prime},\ell^{\prime},k)$ and $\mathbf{a}_{j}=\mathbf{c}\mathbf{c}^{\prime}$ . By (R3), all constants in $\mathbf{b}$ occur in the domain of $J^{g}$ . Moreover, $P_{i}(\mathbf{b})\in J^{g}$ . It thus remains to observe that $h^{\prime}(\mathbf{x}_{i})=\mathbf{b}$ , which follows from (R1) and (R2) and the definition of $h^{\prime}$ .

•

Consider an atom $R_{i}(\mathbf{y}_{i})$ . Let $\mathbf{y}_{i}=y_{1}\cdots y_{n}$ . Then the atom $R_{i}(\mathbf{y}_{i})$ must also be in the former rule and thus $R_{i}(h(\mathbf{y}_{i}))\in J$ , yielding $R(b_{R_{i}(h(\mathbf{y}_{i})),1},\dots,b_{R_{i}(h(\mathbf{y}_{i})),n})\in J^{g}$ . By Condition 1 imposed on $\delta$ , we have $\delta(y_{j})=(R_{i},j)$ for each $j$ . Moreover, by (I1) we must have $\mathbf{u}_{y_{j}}=\mathbf{y}_{i}$ for each $j$ . Thus, the definition of $h^{\prime}$ yields $h^{\prime}(\mathbf{y}_{i})=b_{R_{i}(h(\mathbf{y}_{i})),1}\cdots b_{R_{i}(h(\mathbf{y}_{i})),n}$ and we are done.

•

Consider an atom ${\mathtt{eq}}(z_{i,1},z_{i,2})$ . We know that one of the Cases 2a to 2d apply to ${\mathtt{eq}}(z_{i,1},z_{i,2})$ . We only treat the first case explicitly. Thus assume that $\delta(z_{i,1})=(R,j)$ and $\delta(z_{i,2})=(R,j,R^{\prime},j^{\prime},1)$ . By definition, $h^{\prime}(z_{i,1})=b_{R(h(\mathbf{u}_{z_{i,1}})),j}$ and $h^{\prime}(z_{i,2})=b_{R(\mathbf{c}),j,R^{\prime}(\mathbf{c}^{\prime}),j^{\prime},1}$ where $h(\mathbf{u}_{z_{i,1}})=\mathbf{c}\mathbf{c}^{\prime}$ . By (I2), $\mathbf{u}_{z_{i,1}}$ is identical to the first variable block in $\mathbf{u}_{z_{i,2}}$ and thus $h(\mathbf{u}_{z_{i,1}})=\mathbf{c}$ . By definition if $I^{g}$ , $J^{g}$ contains ${\mathtt{eq}}(b_{R(\mathbf{c}),j},b_{R(\mathbf{c}),j,R^{\prime}(\mathbf{c}^{\prime}),j^{\prime},1})$ and we are done.

It can now be verified that the application of the latter rule adds to $J^{g}$ the fact that corresponds to $P^{\mu_{0}}(h(\mathbf{x}^{\prime}_{0}))$ .

DLog-rewritability of CSPs is NP-complete [Bar16, CL17] and it was observed in [BtCLW14] that this result lifts to generalized CSPs. It thus follows from Theorems 1 and 1 and Lemma 12 that DLog-rewritability of Boolean MDDLog programs that have equality is decidable in 2NExpTime. It is straightforward to verify that the 2NExpTime lower bound for Datalog-rewritability of MDDLog programs from [BL16] applies also to programs that have equality.

Theorem 13.

For Boolean MDDLog programs that have equality, Datalog-rewritability is 2NExpTime-complete.

Regarding MDDLog programs that do not have equality, the above yields a sound but possibly incomplete algorithm for deciding DLog-rewritability. Let us make this more precise. For an MDDLog program $\Pi$ that does not have equality, we use $\Pi^{=}$ to denote the extension of $\Pi$ with the fresh EDB relation ${\mathtt{eq}}$ and the above rules. If $\Pi$ has equality, then $\Pi^{=}$ simply denotes $\Pi$ .

Lemma 14.

For MDDLog programs $\Pi$ , DLog-rewitability of $\Pi^{=}$ implies DLog-rewritability of $\Pi$ .

Lemma 14 follows from the trivial observation that any DLog-rewriting of $\Pi^{=}$ can be converted into a DLog-rewriting of $\Pi$ by dropping all rules that use the relation ${\mathtt{eq}}$ . It is an interesting open question whether the converse of Lemma 14 holds. Due to Lemma 14, a sound but possibly incomplete algorithm for unrestricted MDDLog programs $\Pi$ can thus be formulated as follows: first replace $\Pi$ with $\Pi^{=}$ and then decide DLog-rewritability as per Theorem 13. We speculate that this algorithm is actually complete. In particular, for CSPs it is known that adding equality does preserve Datalog-rewritability [LZ07], and completeness of our algorithm is equivalent to an analogous result holding for MDDLog.

6.2. Canonical Datalog-Rewritings

For constructing actual DLog-rewritings instead of only deciding their existence, canonical Datalog programs play an important role. Feder and Vardi show that for every CSP template $T$ and all $\ell,k>0$ , one can construct an $(\ell,k)$ -Datalog program that is canonical for $T$ in the sense that if there is any $(\ell,k)$ -Datalog program which is equivalent to the complement of $T$ , then the canonical one is [FV98]. In this section, we show that there are similarly simple canonical Datalog programs for Boolean MDDLog. Note that the existence of canonical Datalog programs for MMSNP (and thus for Boolean MDDLog) is already known from [BD13]. However, the construction given there is more general and rather complex, proceeding via an infinite template and exploiting that it is $\omega$ -categorial. This makes it hard to analyze the exact structure and size of the resulting canonical programs. Here, we define canonical Datalog programs for Boolean MDDLog programs in a more elementary way. In contrast to the previous subsection, we do not assume that equality is available.

Let $\Pi$ be a Boolean MDDLog program over EDB schema $\mathbf{S}_{E}$ and with IDB relations from $\mathbf{S}_{I}$ . Further let $0\leq\ell<k$ . We aim to construct a canonical $(\ell,k)$ -DLog program for $\Pi$ . The most important properties of this program is that it is sound for $\Pi$ and complete for $\Pi$ on $\mathbf{S}_{E}$ -instances of treewidth $(\ell,k)$ . We first convert $\Pi$ into a DDLog program $\Pi^{\prime}$ that is equivalent to $\Pi$ on instances of treewidth $(\ell,k)$ and then construct the canonical program for $\Pi^{\prime}$ rather than for $\Pi$ . Unlike $\Pi$ , the new program $\Pi^{\prime}$ is not monadic. Informally, the canonical program simulates $\Pi$ on $\mathbf{S}_{E}$ -instances of treewidth $(\ell,k)$ proceeding in a bag-by-bag fashion. This is enabled by the additional non-monadic IDB relations introduced in $\Pi^{\prime}$ which represent information that needs to be passed from bag to bag. We remark that the construction of $\Pi^{\prime}$ is vaguely similar in spirit to the first step of converting an MDDLog program into simple form, c.f. Appendix A.1. To describe it, we need a preliminary.

With every MDDLog rule $p(\mathbf{y})\leftarrow q(\mathbf{x})$ where $q(\mathbf{x})$ is of treewidth $(\ell,k)$ and every $(\ell,k)$ -tree decomposition $(T,(B_{v})_{v\in V})$ of $q(\mathbf{x})$ , we associate a set of rewritten rules constructed as follows. Choose a root $v_{0}$ of the undirected tree $T$ , thus inducing a direction. We write $v\prec v^{\prime}$ if $v^{\prime}$ is a successor of $v$ in $T$ and use $\mathbf{x}_{v^{\prime}}$ to denote $B_{v}\cap B_{v^{\prime}}$ . For all $v\in V\setminus\{v_{0}\}$ such that $|\mathbf{x}_{v}|=m$ , introduce a fresh $m$ -ary IDB relation $Q_{v}$ ; note that $m\leq\ell$ . Now, the set of rewritten rules contains one rule for each $v\in V$ . For $v\neq v_{0}$ , the rule is

[TABLE]

where $p_{v}(\mathbf{y}_{v})$ is the sub-disjunction of $p(\mathbf{y})$ that contains all disjuncts $P(\mathbf{z})$ with $\mathbf{z}\subseteq B_{v}$ and $q(\mathbf{x})|_{B_{v}}$ is the restriction of $q$ to the atoms that contain only variables from $B_{v}$ . For $v_{0}$ , we include the same rule, but use only $p_{v}(\mathbf{y}_{v})$ as the head. The set of rewritten rules associated with $p(\mathbf{y})\leftarrow q(\mathbf{x})$ is obtained by taking the union of the rewritten rules associated with $p(\mathbf{y})\leftarrow q(\mathbf{x})$ and any $(T,(B_{v})_{v\in V})$ .

The DDLog program $\Pi^{\prime}$ is constructed from $\Pi$ as follows:

(1)

first extend $\Pi$ with all contractions of rules in $\Pi$ ; 2. (2)

then delete all rules with $q(\mathbf{x})$ not of treewidth $(\ell,k)$ and replace every rule $p(\mathbf{y})\leftarrow q(\mathbf{x})$ with $q(\mathbf{x})$ of treewidth $(\ell,k)$ with the rewritten rules associated with it.

To clarify the relation between $\Pi$ and $\Pi^{\prime}$ , we remark that it is possible to verify the following conditions; a detailed proof is omitted since these conditions are not going to be used in what follows:

(I)

$\Pi^{\prime}$ is sound for $\Pi$ , that is, for all $\mathbf{S}_{E}$ -instances $I$ , $I\models\Pi^{\prime}$ implies $I\models\Pi$ ; 2. (II)

$\Pi^{\prime}$ is complete for $\Pi$ on $\mathbf{S}_{E}$ -instances of treewidth $(\ell,k)$ , that is, for all such instances $I$ , $I\models\Pi$ implies $I\models\Pi^{\prime}$ .

Note that $\Pi^{\prime}$ is not complete for $\Pi$ on instances of unrestricted treewidth. For example, if $\Pi$ consists of only a goal rule whose rule body is a $k+1$ -clique (without reflexive loops), then $\Pi^{\prime}$ returns false on the instance that consists of the same clique. {exa} Assume that $\Pi$ contains the rule

[TABLE]

and consider the $(2,3)$ -tree decomposition of the rule body that consists of two nodes $v,v^{\prime}$ , $v^{\prime}$ successor of $v$ , with $B_{v}=\{x,y_{1},y_{2}\}$ and $B_{v^{\prime}}=\{y_{1},y_{2},z\}$ . In $\Pi^{\prime}$ , the rule is split into two rules

[TABLE]

Informally, these rules are supposed to cover homomorphisms from the body of the original rule to an $\mathbf{S}^{\prime}_{E}$ -instance of treewidth $(\ell,k)$ such that the variables in $B_{v^{\prime}}$ are mapped to constants from some bag and variables from $B_{v}$ to constants from a neighboring bag. The IDB relation $Q_{v^{\prime}}$ memorizes that we have already seen part of the rule body.

Let $\mathbf{S}^{\prime}_{I}$ denote the additional IDB relations in $\Pi^{\prime}$ . We now construct the canonical $(\ell,k)$ -DLog program $\Gamma^{c}$ for $\Pi$ . Fix constants $a_{1},\dots,a_{\ell}$ . For $\ell^{\prime}\leq\ell$ , we use $\mathfrak{I}_{\ell^{\prime}}$ to denote the set of all $\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I}$ -instances with domain $\mathbf{a}_{\ell^{\prime}}:=a_{1},\dots,a_{\ell^{\prime}}$ . The program uses $\ell^{\prime}$ -ary IDB relations $P_{M}$ , for all $\ell^{\prime}\leq\ell$ and all $M\subseteq\mathfrak{I}_{\ell^{\prime}}$ . It contains all rules $q(\mathbf{x})\rightarrow P_{M}(\mathbf{y})$ , $M\subseteq\mathfrak{I}_{\ell^{\prime}}$ , that satisfy the following conditions:

(1)

$q(\mathbf{x})$ is over schema $\mathbf{S}_{E}\cup\{P_{M}\mid M\subseteq\mathfrak{I}_{\ell^{\prime}},~{}\ell^{\prime}\leq\ell\}$ and contains at most $k$ variables; 2. (2)

for every extension $J$ of the $\mathbf{S}_{E}$ -instance $I_{q}|_{\mathbf{S}_{E}}$ with $\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I}$ -facts such that

(a)

$J$ satisfies all rules of $\Pi^{\prime}$ and does not contain ${\mathtt{goal}}()$ and 2. (b)

for each $P_{N}(\mathbf{z})\in q$ , $N\subseteq\mathfrak{I}_{\ell^{\prime\prime}}$ , there is an $L\in N$ such that $L[\mathbf{z}/\mathbf{a}_{\ell^{\prime\prime}}]=J|_{\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I},\mathbf{z}}$

there is an $L\in M$ such that $L[\mathbf{y}/\mathbf{a}_{\ell^{\prime}}]=J|_{\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I},\mathbf{y}}$

where $I_{q}$ is $q$ viewed as an instance, $L[\mathbf{x}/\mathbf{a}]$ denotes the result of replacing the constants in $\mathbf{a}$ with the variables in $\mathbf{x}$ (possibly resulting in identifications), and $J|_{\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I},\mathbf{x}}$ denotes the simultaneous restriction of $J$ to schema $\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I}$ and constants $\mathbf{x}$ .333We could additionally demand that $M$ is minimal so that Condition 2 is satisfied, but this is not strictly required. We also include in $\Gamma^{c}$ all rules of the form $P_{\emptyset}(\mathbf{x})\rightarrow{\mathtt{goal}}()$ , $P_{\emptyset}$ of any arity from [math] to $\ell$ .

The intuition behind the construction of $\Gamma^{c}$ is as follows. When starting with an input $\mathbf{S}_{E}$ -instance $I$ of treewidth $(\ell,k)$ and then chasing with $\Gamma^{c}$ , that is, exhaustively applying these rules in an unspecified order, then the resulting instance $I^{\prime}$ represents all extensions $J$ of $I$ to the relations in $\mathbf{S}_{I}\cup\mathbf{S}_{I}^{\prime}$ that satisfy all rules in $\Pi^{\prime}$ and do not contain ${\mathtt{goal}}()$ . A fact $P_{M}(\mathbf{a})\in I^{\prime}$ , $M\subseteq\mathfrak{I}_{\ell^{\prime\prime}}$ , means that for every such $J$ there is an $L\in M$ such that $J$ contains the facts in $L[\mathbf{a}/\mathbf{a}_{\ell^{\prime\prime}}]$ . Thus, the set $M$ in the index of $P_{M}$ should be read disjunctively. Note that $P_{\emptyset}(\mathbf{a})\in I^{\prime}$ then indicates that every extension of $I$ that satisfies all rules in $\Pi^{\prime}$ must contain ${\mathtt{goal}}()$ . The bodies of rules in $\Gamma^{c}$ are large enough to cover the restriction of $I$ to the constants from any single bag. This suffices only because we have transitioned from $\Pi$ to $\Pi^{\prime}$ before constructing $\Gamma^{c}$ .

The following are central properties of canonical DLog programs.

Lemma 15.

(1)

$\Gamma^{c}$ * is sound for $\Pi$ ;* 2. (2)

$\Gamma^{c}$ * is complete for $\Pi$ on instances of treewidth $(\ell,k)$ .*

Proof 6.3.

*For Point 1, let $I$ be an $\mathbf{S}_{E}$ -instance with $I\models\Gamma^{c}$ . It suffices to show that $I\models\Pi^{\prime}$ . Let $I=I_{1},I_{2},\dots$ be the sequence of $\mathbf{S}_{E}\cup\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I}$ -instances obtained by chasing $I$ with $\Gamma^{c}$ . We first note that the following can be proved by induction on $i$ (and using the definition of $\Gamma_{c}$ ):

Claim. If $P_{M}(\mathbf{b})\in I_{i}$ , $M\subseteq\mathfrak{I}_{\ell^{\prime}}$ , then for every extension $J$ of $I$ to the relations in $\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I}$ that satisfies all rules of $\Pi^{\prime}$ and does not contain ${\mathtt{goal}}()$ , there is an $L\in M$ such that $L[\mathbf{b}/\mathbf{a}_{\ell^{\prime}}]=J|_{\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I},\mathbf{b}}$ .

Since $I\models\Gamma^{c}$ , there are $i>0$ and $\mathbf{b}\subseteq\mathsf{dom}(I)$ such that $P_{\emptyset}(\mathbf{b})\in I_{i}$ . By the claim, there is thus no extension $J$ of $I$ to the relations in $\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I}$ that satisfies all rules of $\Pi$ and does not contain ${\mathtt{goal}}()$ . Consequently, $I\models\Pi^{\prime}$ .*

For Point 2, assume that $I\not\models\Gamma^{c}$ and let $(T,(B_{v})_{v\in V})$ be an $(\ell,k)$ -tree decomposition of $I$ , $T=(V,E)$ . Then there is an extension $J$ of $I$ to the IDB relations in $\Gamma^{c}$ such that all rules in $\Gamma^{c}$ are satisfied and $J$ contains no atom of the form $P_{\emptyset}(\mathbf{b})$ .

We use $J$ to construct an extension $J^{\prime}$ of $I$ to the relations in $\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I}$ . Choose a root $v_{0}$ of $T$ , thus inducing a direction on the undirected tree $T$ . For all $v\in V$ and successors $v^{\prime}$ of $v$ , choose an ordering $\mathbf{c}_{v,v^{\prime}}$ of the constants in $B_{v}\cap B_{v^{\prime}}$ and let ${\ell_{v,v^{\prime}}}$ denote the number of these constants. Let $P_{M_{1}}(\mathbf{c}_{v,v^{\prime}}),\dots,P_{M_{r}}(\mathbf{c}_{v,v^{\prime}})$ be all facts of this form in $J$ . By construction of $\Gamma^{c}$ , there must be at least one such fact, and the fact $P_{M_{1}\cap\cdots\cap M_{r}}(\mathbf{c}_{v,v^{\prime}})$ must also be in $J$ . Thus, we can associate with $v,v^{\prime}$ a unique minimal set $M_{v,v^{\prime}}$ so that $P_{M_{v,v^{\prime}}}(\mathbf{c}_{v,v^{\prime}})\in J$ .

The construction of $J^{\prime}$ proceeds top down over $T$ . At all points, we maintain the invariant that

( $*$ )

for all nodes $v\in V$ and successors $v^{\prime}$ of $v$ , there is an $L\in M_{v,v^{\prime}}$ such that $L[\mathbf{c}_{v,v^{\prime}}/\mathbf{a}_{\ell_{v,v^{\prime}}}]=J^{\prime}|_{\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I},\mathbf{c}_{v,v^{\prime}}}$ .

The construction of $J^{\prime}$ starts at the root $v_{0}$ of $T$ . There must be an extension $J_{v_{0}}$ of $I|_{B_{v_{0}}}$ with $S_{I}\cup S^{\prime}_{I}$ -facts such that

(i)

$J_{v_{0}}$ * satisfies all rules of $\Pi$ and does not contain ${\mathtt{goal}}()$ * 2. (ii)

for each $P_{M}(\mathbf{b})\in J|_{B_{v_{0}}}$ , $M\subseteq\mathfrak{I}_{\ell^{\prime}}$ , there is an $L\in M$ such that $L[\mathbf{b}/\mathbf{a}_{\ell^{\prime}}]=J_{v_{0}}|_{S_{i}\cup S^{\prime}_{I},\mathbf{b}}$

as, otherwise, a rule of $\Gamma^{c}$ would create an atom of the form $P_{\emptyset}(\mathbf{c})$ in $J$ . Start with putting $J^{\prime}=I\cup J_{v_{0}}$ . Note that for each successor $v$ of $v_{0}$ , ( $*$ ) is satisfied because of Point (ii) and since $P_{M_{v_{0},v}}(a_{v_{0},v})\in J|_{B_{v_{0}}}$ .

We proceed top-down over $T$ . Assume that $v^{\prime}$ is a successor of $v$ and $B_{v}$ has already been treated. There must be an extension $J_{v^{\prime}}$ of $I|_{B_{v^{\prime}}}$ with $S_{I}\cup S^{\prime}_{I}$ -facts such that

(i)

$J_{v^{\prime}}$ * satisfies all rules of $\Pi$ and does not contain ${\mathtt{goal}}()$ ,* 2. (ii)

$J_{v^{\prime}}|_{\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I},\mathbf{c}_{v,v^{\prime}}}=J^{\prime}|_{\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I},\mathbf{c}_{v,v^{\prime}}}$ , and 3. (iii)

for each $P_{M}(\mathbf{b})\in J|_{B_{v^{\prime}}}$ , $M\subseteq\mathfrak{I}_{\ell^{\prime}}$ , there is an $L\in M$ such that $L[\mathbf{b}/\mathbf{a}_{\ell^{\prime}}]=J_{v^{\prime}}|_{S_{i}\cup S^{\prime}_{I},\mathbf{b}}$

as, otherwise, because of ( $*$ ) a rule of $\Gamma^{c}$ would create an atom of the form $P_{M}(\mathbf{c}_{v,v^{\prime}})$ in $J$ with $M\subsetneq M_{v,v^{\prime}}$ , in contradiction to $M_{v,v^{\prime}}$ being minimal with $P_{M_{v,v^{\prime}}}(\mathbf{c}_{v,v^{\prime}})\in J$ . Put $J^{\prime}=J^{\prime}\cup J_{v^{\prime}}$ . It can again be verified that ( $*$ ) is satisfied.

By construction, the instance $J^{\prime}$ does not contain ${\mathtt{goal}}()$ and $(T,(B_{v})_{v\in V})$ is also a tree decomposition of $J^{\prime}$ , that is, each EDB atom and each IDB atom of $J^{\prime}$ falls within some bag $B_{v}$ . We aim to show that $J^{\prime}$ satisfies all rules of $\Pi$ , thus $I\not\models\Pi$ as required.

Let $\Pi_{0}$ be the result of closing $\Pi$ under contractions of rules and recall that $\Pi^{\prime}$ is obtained from $\Pi_{0}$ by dropping and rewriting rules. Let $\rho$ be a rule in $\Pi$ and let $h$ be a homomorphism from its body to $J^{\prime}$ . We have to show that one of the disjuncts in the head of $\rho$ is satisfied under $h$ . $\Pi_{0}$ contains the rule $\rho_{0}$ obtained from $\rho$ by identifying all variables $x,y$ such that $h(x)=h(y)$ . It clearly suffices to show that one of the disjuncts in the head of $\rho_{0}$ is satisfied under $h$ . Note that $h$ is an injective homomorphism from the body $q(\mathbf{x})$ of $\rho_{0}$ to $J^{\prime}$ which implies that $q(\mathbf{x})$ is of treewidth $(\ell,k)$ . Moreover, we can read off an $(\ell,k)$ -tree decomposition $(T^{\prime},(B^{\prime}_{v})_{v\in V^{\prime}})$ of $q(\mathbf{x})$ from $h$ and $(T,(B_{v})_{v\in V})$ .

In $\Pi^{\prime}$ , $\rho_{0}$ and $(T^{\prime},(B^{\prime}_{v})_{v\in V^{\prime}})$ are rewritten into rules $\rho_{1},\dots,\rho_{m}$ such that no $\rho_{i}$ uses a fresh IDB relation from the head of any $\rho_{j}$ with $j\geq i$ (that is, an IDB relation that does not occur in $\Pi_{0}$ , of arity at most $\ell$ ). Let $\rho_{i}$ be $q_{i}(\mathbf{x}_{i})\rightarrow R_{i,1}(\mathbf{x}_{i,1})\vee\cdots\vee R_{i,n_{i}}(\mathbf{x}_{i,n_{i}})\vee Q_{i}(\mathbf{z}_{i})$ where $R_{i,1}(\mathbf{x}_{i,1}),\dots,R_{i,n_{i}}(\mathbf{x}_{i,n_{i}})$ are disjuncts that also occur in the head of $\rho_{0}$ and $Q_{i}$ is a fresh IDB relation introduced by the rewriting in the case that $i<m$ and ${\mathtt{false}}$ if $i=m$ (by which we mean: there is no $Q_{i}(\mathbf{z}_{i})$ disjunct in the latter case). One can show by induction on $i$ that for $1\leq i\leq m$ ,

(1)

$R_{j,m}(h(\mathbf{x}_{j,t}))\in J^{\prime}$ * for some $j\leq i$ and $t\in\{1,\dots,n_{j}\}$ or* 2. (2)

$Q_{i}(h(\mathbf{z}_{i}))\in J^{\prime}$ .

To see this, assume that Point 1 is not satisfied for some $i$ . Then Point 2 holds for all $j<i$ . By choice of $\rho_{i}$ , there is a $v\in V$ such that $h(\mathbf{x}_{i})\subseteq B_{v}$ . Thus $h$ is a homomorphism from $q_{i}(\mathbf{x}_{i})$ to $J_{v}$ , and consequently there is a disjunct $R(\mathbf{z})$ in the head of $\rho_{i}$ such that $R(h(\mathbf{z}))\in J_{v}\subseteq J^{\prime}$ . This implies that one of Points 1 or 2 is satisfied for $i$ .

Note that Point 2 cannot hold for $i=m$ because the $Q_{m}$ disjunct is not present in $\rho_{m}$ . Thus there is an $i\leq m$ such that $R_{i,j}(h(\mathbf{x}_{i,j}))\in J^{\prime}$ for some $j$ . Since $R_{i,j}(\mathbf{x}_{i,j})$ occurs in the head of $\rho_{0}$ , we are done.

We are now ready to show that the canonical program is indeed canonical, as detailed by the following theorem. For two Boolean DLog programs $\Pi_{1},\Pi_{2}$ over the same EDB schema $\mathbf{S}_{E}$ , we write $\Pi_{1}\subseteq\Pi_{2}$ if for every $\mathbf{S}_{E}$ -instance $I$ , $I\models\Pi_{1}$ implies $I\models\Pi_{2}$ .

Theorem 16.

Let $\Pi$ be a Boolean MDDLog program, $0\leq\ell\leq k$ , and $\Gamma^{c}$ the canonical $(\ell,k)$ -DLog program for $\Pi$ . Then

(1)

$\Gamma\subseteq\Gamma^{c}$ * for every $(\ell,k)$ -DLog program $\Gamma$ that is sound for $\Pi$ ;* 2. (2)

$\Pi$ * is $(\ell,k)$ -DLog-rewritable iff $\Gamma^{c}$ is a DLog-rewriting of $\Pi$ .*

Proof 6.4.

Let $\mathbf{S}_{E}$ be the EDB schema of $\Pi$ .

For Point 1, let $\Gamma$ be an $(\ell,k)$ -DLog program that is sound for $\Pi$ and let $I$ be an $\mathbf{S}_{E}$ -instance with $I\models\Gamma$ . From the proof tree for ${\mathtt{goal}}()$ from $I$ and $\Gamma$ , we can construct an $\mathbf{S}_{E}$ -instance $J$ of treewidth $(\ell,k)$ such that $J\models\Gamma$ and $J\rightarrow I$ . It suffices to show that $J\models\Gamma_{c}$ , which is easy: from $J\models\Gamma$ , we obtain $J\models\Pi$ and Point 2 of Lemma 15 yields $J\models\Gamma^{c}$ .

The “if” direction of Point 2 is trivial. For the “only if” direction, assume that $\Pi$ is $(\ell,k)$ -DLog-rewritable and let $\Gamma$ be a concrete rewriting. We have to show that $\Gamma^{c}$ is sound and complete for $\Pi$ . The former is Point 1 of Lemma 15. For the latter, we get $\Pi\subseteq\Gamma$ since $\Gamma$ is a rewriting of $\Pi$ and $\Gamma\subseteq\Gamma^{c}$ from Point 1, thus $\Pi\subseteq\Gamma^{c}$ as required.

Note that by Point 1 of Theorem 16, the canonical $(\ell,k)$ -DLog program for an MDDLog program $\Pi$ is interesting even if $\Pi$ is not rewritable into an $(\ell,k)$ -DLog program as it is the strongest sound $(\ell,k)$ -DLog approximation of $\Pi$ .

7. Non-Boolean MDDLog Programs

We lift the results about the complexity of rewritability, about canonical DLog programs, and about the shape of rewritings and obstructions from the case of Boolean MDDLog programs to the non-Boolean case. For all of this, a certain extension of $(\ell,k)$ -Datalog programs with parameters plays a central role. We thus begin by introducing these extended programs.

7.1. Deciding Rewritability

An $(\ell,k)$ -Datalog program with $n$ parameters is an $n$ -ary $(\ell+n,k+n)$ -Datalog program in which all IDBs have arity at least $n$ and where in every rule, all IDB atoms agree on the variables used in the last $n$ positions (both in rule bodies and heads and including the ${\mathtt{goal}}$ IDB). The last $n$ positions of IDBs are called parameter positions. To visually separate the parameter positions from the non-distinguished positions, we use “ $|$ ” as a delimiter to replace the usual comma, writing e.g.

[TABLE]

where $P,Q$ are IDB, $R$ is EDB, and there are three parameter positions. Note that, by definition, all variable positions in ${\mathtt{goal}}$ atoms are parameter positions. {exa}

The following is an MDLog program with one parameter that returns all constants which are on an $R$ -cycle, $R$ a binary EDB relation:

[TABLE]

Parameters in Datalog programs play a similar role as parameters to least fixed-point operators in FO(LFP), see for example [BBV16] and references therein. The program in Example 7.1 is not definable in MDLog without parameters, which shows that adding parameters increases expressive power. Although $(\ell,k)$ -DLog programs with $n$ parameters are $(\ell+n,k+n)$ -DLog programs, one should think of them as a mild generalization of $(\ell,k)$ -programs.

A DLog program is an $\ell$ -DLog program if it is an $(\ell,k)$ -DLog program for some $k$ . To lift decidability and complexity results from the Boolean to the non-Boolean case, we show that rewritability of an $n$ -ary MDDLog program into $\ell$ -DLog with $n$ parameters can be reduced to rewritability of a Boolean MDDLog program into $\ell$ -DLog (without parameters). We believe that Datalog with parameters is a natural rewriting target for non-Boolean MDDLog programs since, in a sense, the $n$ parameters reflect the special role of the constants from the input instance that are returned as an answer. Note that the case $\ell=0$ is about UCQ-rewritability (and thus FO-rewritability) because [math]-DLog programs (with and without parameters) are an alternative presentation of UCQs. The reduction proceeds in two steps, described by subsequent Lemmas 17 and 18. {exa}

The following MDDLog program is rewritable into the MDLog program with parameters from Example 7.1, but not into an MDLog program without parameters:

[TABLE]

The following lemma shows that, by introducing constants, we can reduce the rewritability of non-Boolean MDDLog programs into Datalog with parameters to the rewritability of Boolean MDDLog programs with constants into Datalog with constants. Note that the presence of constants in an $(\ell,k)$ -DLog program is not reflected in the values of $\ell$ and $k$ . We will show in a second step that the rewritability of Boolean MDDLog programs with constants into Datalog with constants can be reduced to the rewritability of Boolean MDDLog programs without constants into Datalog without constants.

The diameter of an $(\ell,k)$ -DLog program with $n$ parameters is $k$ and the diameter of a DLog program with constants is defined as for DLog programs without constants, that is, only variables contribute to the diameter, but constants do not. The rule size of an MDDLog program is the maximum number of variable occurrences in a rule body.

Lemma 17.

Given an $n$ -ary MDDLog program $\Pi$ , one can construct Boolean MDDLog programs with constants $\Pi_{1},\dots,\Pi_{m}$ over the same EDB schema such that for all $\ell,k$ ,

(1)

$\Pi$ * is rewritable into an $(\ell,k)$ -DLog program with $n$ parameters iff each of $\Pi_{1},\dots,\Pi_{m}$ is rewritable into an $(\ell,k)$ -DLog program with constants;* 2. (2)

$m\leq n^{n}$ * and the size (resp. diameter, rule size) of each program $\Pi_{i}$ is bounded by the size (resp. diameter, rule size) of $\Pi$ .*

The construction takes time polynomial in the size of $|\Pi_{1}\cup\cdots\cup\Pi_{m}|$ .

Proof 7.1.

Let $\Pi$ be an $n$ -ary MDDLog program over EDB schema $\mathbf{S}_{E}$ . Fix a set $C$ of $n$ constants. For each $\mathbf{c}\in C^{n}$ , we construct from $\Pi$ a Boolean MDDLog program $\Pi_{\mathbf{c}}$ such that for any $\ell<k$ , $\Pi$ is $(\ell,k)$ -DLog rewritable iff all programs $\Pi_{\mathbf{c}}$ are.

Let $\mathbf{c}\in C^{n}$ . Given two $n$ -tuples of terms (constants or variables) $\mathbf{s}$ and $\mathbf{t}$ , we write $\mathbf{s}\preceq\mathbf{t}$ if $t_{i}=t_{j}$ implies $s_{i}=s_{j}$ for $1\leq i<j\leq n$ . We write $\mathbf{s}\approx\mathbf{t}$ when $\mathbf{s}\preceq\mathbf{t}\preceq\mathbf{s}$ . The program $\Pi_{\mathbf{c}}$ is obtained from $\Pi$ as follows:

•

replace every rule ${\mathtt{goal}}(\mathbf{x})\leftarrow q(\mathbf{x},\mathbf{y})$ with $\mathbf{c}\preceq\mathbf{x}$ by ${\mathtt{goal}}()\leftarrow q(\mathbf{c},\mathbf{y})$ ;

•

drop every rule ${\mathtt{goal}}(\mathbf{x})\leftarrow q(\mathbf{x},\mathbf{y})$ with $\mathbf{c}\not\preceq\mathbf{x}$ .

*Note that the non-goal rules in $\Pi_{\mathbf{c}}$ are identical to those in $\Pi$ . By converting proof trees for $\Pi$ into proof trees for $\Pi_{\mathbf{c}}$ and vice versa, one can show the following.

Claim. For all $\mathbf{S}_{E}$ -instances $I$ and $\mathbf{a}\subseteq\mathsf{dom}(I)^{n}$ with $\mathbf{a}\approx\mathbf{c}$ , $I\models\Pi(\mathbf{a})\mbox{ iff }I[\mathbf{c}/\mathbf{a}]\models\Pi_{\mathbf{c}}$ .

We show that $\Pi$ is rewritable into an $(\ell,k)$ -DLog program with $n$ parameters iff all of the constructed programs $\Pi_{\mathbf{c}}$ are rewritable into an $(\ell,k)$ -DLog program with constants.*

Let $\Gamma$ be an $(\ell,k)$ -DLog program with $n$ parameters that is a rewriting of $\Pi$ . For each $\mathbf{c}\in C^{n}$ , let $\Gamma_{\mathbf{c}}$ be the Boolean $(\ell,k)$ -DLog program with constants obtained from $\Gamma$ as follows:

•

replace every rule $P(\mathbf{x}\,|\,\mathbf{y})\leftarrow q(\mathbf{z}\,|\,\mathbf{y})$ with $\mathbf{c}\preceq\mathbf{y}$ (and where $P$ might be ${\mathtt{goal}}$ ) by $P(\mathbf{x}_{\mathbf{c}})\leftarrow q(\mathbf{z}_{\mathbf{c}})$ , where $\mathbf{v}_{\mathbf{c}}$ is the result of replacing in $\mathbf{v}$ each variable $y_{i}$ with $c_{i}$ ;

•

drop every rule $P(\mathbf{x}\,|\,\mathbf{y})\leftarrow q(\mathbf{z}\,|\,\mathbf{y})$ with $\mathbf{c}\not\preceq\mathbf{y}$ .

By translating proof trees, it can be shown that ( $*$ ) $I\models\Gamma(\mathbf{c})$ iff $I\models\Gamma_{\mathbf{c}}$ . It is now easy to show that $\Gamma_{\mathbf{c}}$ is a rewriting of $\Pi_{\mathbf{c}}$ : for every $\mathbf{S}_{E}$ -instance $I$ , $I\models\Pi_{\mathbf{c}}$ iff $I\models\Pi(\mathbf{c})$ (by the claim) iff $I\models\Gamma(\mathbf{c})$ (since $\Gamma$ is a rewriting of $\Pi$ ) iff $I\models\Gamma_{\mathbf{c}}$ (by ( $*$ )).

Conversely, for all $\mathbf{c}\in C^{n}$ let $\Gamma_{\mathbf{c}}$ be a Boolean $(\ell,k)$ -DLog program with constants that is a rewriting of $\Pi_{\mathbf{c}}$ . We construct an $(\ell,k)$ -DLog program with $n$ parameters $\Gamma$ as follows. For each $\mathbf{c}\in C^{n}$ , fix a tuple $\mathbf{v}$ of fresh variables such that $\mathbf{v}\approx\mathbf{c}$ . Let $\Gamma^{v}_{\mathbf{c}}$ be the $(\ell,k)$ -DLog program with $n$ parameters obtained from $\Gamma_{\mathbf{c}}$ as follows:

(i)

replace each $c_{i}$ with $v_{i}$ ; 2. (ii)

replace each non-goal IDB atom $P(\mathbf{x})$ with the atom $P^{\mathbf{c}}(\mathbf{x}\,|\,\mathbf{v})$ (both in rule bodies and heads), $P^{\mathbf{c}}$ a fresh IDB relation; 3. (iii)

replace ${\mathtt{goal}}()$ with ${\mathtt{goal}}(\mathbf{v})$ .

Then $\Gamma$ is defined as the union of all programs $\Gamma^{v}_{\mathbf{c}}$ . We first argue that for every $\mathbf{c}\in C^{n}$ , $\mathbf{S}_{E}$ -instance $I$ , and $\mathbf{a}\subseteq\mathsf{dom}(I)^{n}$ with $\mathbf{a}\approx\mathbf{c}$ ,

(1)

$I\models\Gamma_{\mathbf{c}}$ * implies $I[\mathbf{a}/\mathbf{c}]\models\Gamma^{v}_{\mathbf{c}}(\mathbf{a})$ and* 2. (2)

$I\models\Gamma^{v}_{\mathbf{c^{\prime}}}(\mathbf{a})$ , with $\mathbf{a}\preceq\mathbf{c}^{\prime}$ , implies $I[\mathbf{c}/\mathbf{a}]\models\Gamma_{\mathbf{c}}$ .

Point 1 can be proved by showing that, from a proof tree of ${\mathtt{goal}}()$ from $I$ and $\Gamma_{\mathbf{c}}$ , one can construct a proof tree of ${\mathtt{goal}}(\mathbf{a})$ from $I[\mathbf{a}/\mathbf{c}]$ and $\Gamma^{v}_{\mathbf{c}}$ . For Point 2, assume $I\models\Gamma^{v}_{\mathbf{c^{\prime}}}(\mathbf{a})$ with $\mathbf{a}\preceq\mathbf{c}^{\prime}$ . Then $I[\mathbf{c}/\mathbf{a}]\models(\Gamma_{\mathbf{c^{\prime}}})[\mathbf{c}/\mathbf{c^{\prime}}]$ can again be shown by manipulating proof trees. It can be verified that, by construction, $(\Pi_{\mathbf{c^{\prime}}})[\mathbf{c}/\mathbf{c^{\prime}}]\subseteq\Pi_{\mathbf{c}}$ . Consequently and since $\Gamma_{\mathbf{c}^{\prime}}$ is a rewriting of $\Pi_{\mathbf{c}^{\prime}}$ , $J\models(\Gamma_{\mathbf{c^{\prime}}})[\mathbf{c}/\mathbf{c^{\prime}}]$ implies $J\models\Gamma_{\mathbf{c}}$ for all $J$ , that is, $(\Gamma_{\mathbf{c^{\prime}}})[\mathbf{c}/\mathbf{c^{\prime}}]$ is contained in $\Gamma_{\mathbf{c}}$ in the sense of query containment. Thus in particular $I[\mathbf{c}/\mathbf{a}]\models\Gamma_{\mathbf{c}}$ , as required.

It remains to show that $\Gamma$ is a rewriting for $\Pi$ . First assume that $I\models\Pi(\mathbf{a})$ . Choose some $\mathbf{c}\in C^{n}$ with $\mathbf{a}\approx\mathbf{c}$ . Then $I[\mathbf{c}/\mathbf{a}]\models\Pi_{\mathbf{c}}$ by the claim and thus $I[\mathbf{c}/\mathbf{a}]\models\Gamma_{\mathbf{c}}$ since $\Gamma_{\mathbf{c}}$ is a rewriting of $\Pi_{\mathbf{c}}$ . Point 1 above yields $I[\mathbf{c}/\mathbf{a}][\mathbf{a}/\mathbf{c}]=I\models\Gamma(\mathbf{a})$ .

Now assume that $I\models\Gamma(\mathbf{a})$ . Then by construction of $\Gamma$ , there is a $\mathbf{c}^{\prime}\in C^{n}$ such that $\mathbf{a}\preceq\mathbf{c}^{\prime}$ and $I\models\Gamma^{v}_{\mathbf{c^{\prime}}}(\mathbf{a})$ . To see this, note in particular that the different programs $\Gamma^{v}_{\mathbf{c}}$ do not share any IDBs and thus do not interact in $\Gamma$ . Choose a $\mathbf{c}\in C^{n}$ with $\mathbf{a}\approx\mathbf{c}$ . From Point 2 above, we obtain $I[\mathbf{c}/\mathbf{a}]\models\Gamma_{\mathbf{c}}$ which yields $I[\mathbf{c}/\mathbf{a}]\models\Pi_{\mathbf{c}}$ . This implies $I\models\Pi(\mathbf{a})$ by the claim.

We next show that constants can be eliminated from Boolean programs.

Lemma 18.

Given a Boolean MDDLog program $\Pi_{c}$ with constants over EDB schema $\mathbf{S}_{E}$ , one can construct a Boolean MDDLog program $\Pi$ over an EDB schema $\mathbf{S}_{E}^{\prime}$ such that

(1)

$\Pi_{c}$ * is rewritable into $\ell$ -DLog with constants iff $\Pi$ is rewritable into $\ell$ -DLog, for any $\ell$ ;* 2. (2)

If $\Pi_{c}$ is of size $n$ and diameter $k$ , then the size of $\Pi$ is $2^{p(k\cdot{\mathtt{log}}n)}$ ; moreover, the diameter of $\Pi$ is bounded by the rule size of $\Pi_{c}$ .

The construction takes time polynomial in the size of $|\Pi|$ .

Proof 7.2.

Let $\Pi_{c}$ be a Boolean MDDLog program over EDB schema $\mathbf{S}_{E}$ that contains constants $c_{1},\ldots,c_{n}$ . The program $\Pi$ will be over EDB schema $\mathbf{S}^{\prime}_{E}=\mathbf{S}_{E}\cup\{R_{1},\ldots,R_{n}\}$ where $R_{1},\dots,R_{n}$ are fresh monadic relation symbols. $\Pi$ contains all rules that can be obtained from a rule $\rho$ in $\Pi$ by choosing a partial function $\delta$ that maps terms (variables or constants) in $\rho$ to elements of $\{1,\dots,n\}$ such that $\delta(c_{i})=i$ for each constant $c_{i}$ and then, for each term $t$ with $\delta(t)=i$ ,

(1)

replacing each occurrence of $t$ in the body of $\rho$ with a fresh variable $x$ and adding $R_{i}(x)$ , and 2. (2)

replacing each occurrence of $t$ in the head of $\rho$ with one of the fresh variables introduced for $t$ in Step 1.

Additionally, $\Pi$ contains the rule ${\mathtt{goal}}()\leftarrow R_{i}(x),R_{j}(x)$ , for $1\leq i<j\leq n$ .

*Note that the rewriting presented above, which we call dejoining since it introduces different variables for each occurence of a term $t$ in a rule body, can be applied not only to MDDLog programs, but also to MDLog programs. Before we proceed, we make a basic observation about dejoining and its connection to a certain quotient construction. Let $\Pi$ be an MDDLog program or an MDLog program, with constants $c_{1},\dots,c_{n}$ , and let $\Pi_{d}$ be the result of dejoining $\Pi$ . Let $I$ be an $\mathbf{S}^{\prime}_{E}$ -instance such that $R_{i},R_{j}$ are disjoint whenever $i\neq j$ and which does not contain the constants $c_{1},\dots,c_{n}$ . The quotient of $I$ is the $\mathbf{S}_{E}$ -instance $I^{\prime}$ obtained from $I$ by replacing every $d\in\mathsf{dom}(I)$ with $R_{i}(d)\in I$ by the constant $c_{i}$ (which also results in the identification of elements in the active domain) and removing all atoms involving one of the $R_{i}$ relations. By converting proof trees of ${\mathtt{goal}}()$ from $\Pi$ into proof trees of ${\mathtt{goal}}()$ from $\Pi_{c}$ and vice versa, one can show the following.

Claim. $I\models\Pi$ iff $I^{\prime}\models\Pi_{d}$ .

We now show that $\Pi_{c}$ is rewritable into $\ell$ -DLog iff $\Pi$ is.*

First let $\Gamma_{c}$ be an $\ell$ -DLog rewriting of $\Pi_{c}$ . Let $\Gamma$ be obtained from $\Gamma_{c}$ by dejoining all rules and adding the rule ${\mathtt{goal}}()\leftarrow R_{i}(x),R_{j}(x)$ for $1\leq i<j\leq n$ . Clearly, $\Gamma$ is an $\ell$ -DLog program. We argue that $\Gamma$ is a rewriting of $\Pi$ . Let $I$ be an $\mathbf{S}^{\prime}_{E}$ -instance. W.l.o.g., we can assume that $I$ does not contain $c_{1},\dots,c_{n}$ . If $R_{i},R_{j}$ are not disjoint for some $i\neq j$ , then $I\models\Pi$ and $I\models\Gamma$ . Otherwise, let $I^{\prime}$ be the quotient of $I$ . We have $I\models\Pi$ iff $I^{\prime}\models\Pi_{c}$ (by the claim) iff $I^{\prime}\models\Gamma_{c}$ ( $\Gamma_{c}$ is rewriting of $\Pi_{c}$ ) iff $I\models\Gamma$ (again by the claim).

Let $\Gamma$ be an $\ell$ -DLog rewriting of $\Pi$ . Let $\Gamma_{c}$ be the program constructed from $\Gamma$ by removing all rules that contain atoms of the form $R_{i}(x)$ and $R_{j}(x)$ with $i\neq j$ and replacing all variables $x$ that occur in a rule body in atoms of the form $R_{i}(x)$ with $c_{i}$ and removing all $R_{i}$ -atoms from such rules. Clearly, $\Gamma_{c}$ is an $\ell$ -DLog program (with constants $c_{1},\dots,c_{n}$ ). We argue that $\Gamma_{c}$ is a rewriting of $\Pi_{c}$ . Let $I$ be an $\mathbf{S}_{E}$ -instance that w.l.o.g. does not contain $c_{1},\dots,c_{n}$ and let $I^{\prime}=I\cup\{R_{1}(c_{1}),\ldots,R_{n}(c_{n})\}$ . Note that $I$ is the quotient of $I^{\prime}$ . Then $I\models\Pi_{c}$ iff $I^{\prime}\models\Pi$ (by the claim) iff $I^{\prime}\models\Gamma$ ( $\Gamma$ is rewriting of $\Pi$ ) iff $I\models\Gamma_{c}$ (by construction of $\Gamma_{c}$ ).

We are now ready to lift the complexity results from Theorems 6 and 13 to the non-Boolean case, by putting them together with Lemmas 17 and 18.

Theorem 19.

For $n$ -ary MDDLog programs,

(1)

FO-rewritability (equivalently: UCQ-rewritability) is 2NExpTime*-complete;* 2. (2)

rewritability into MDLog with $n$ parameters is in 3ExpTime* (and 2NExpTime-hard);* 3. (3)

DLog-rewritability is 2NExpTime*-complete for programs that have equality.*

Proof 7.3.

We remind that the upper bounds for rewritability of Boolean MDDLog programs stated in Theorems 6 and 13 are obtained by a (generalized) CSP and then deciding the rewritability of (the complement of) that CSP. One can trace the blowups stated in Lemmas 17 and 18 as well as in Theorems 1 and 1 to verify that the constructed CSP does not become significantly larger in the non-Boolean case, that is, it still satisfy the bounds stated in Point 3 of Proposition 1. Thus, we obtain the same upper bounds as in the Boolean case. Regarding Point 1, we additionally recall that Proposition 1 also covers non-Boolean MDDLog programs and thus it suffices to consider UCQ-rewritability. Regarding Point 3, we note that it can be verified that the constructions in the proofs of Lemmas 17 and 18 preserve the property of having equality.

In view of Point 2, we remark (once more) that for non-Boolean MDDLog programs $\Pi$ , MDLog with parameters is in a sense a more natural target for rewriting than MDLog without parameters. The intuitive reason is that positions in the answer to $\Pi$ can be thought of as constants, and constants correspond to parameters. To make this a bit more precise, consider the grounding $\Pi^{\prime}$ of $\Pi$ obtained by replacing, in every goal rule, each variable that occurs in the head by a constant. In contrast to the standard database setup (and in contrast to the proof of Lemma 17), we mean here constants that are interpreted according to the standard FO semantics, that is, different constants can denote the same element of an instance. When looking for an MDLog-rewriting of $\Pi^{\prime}$ , it is clearly very natural to admit the constants from $\Pi^{\prime}$ also in the rewriting. Now, one can verify that any such rewriting can be translated in a straightforward way into a rewriting of $\Pi$ into MDLog with parameters, and vice versa.

We further note that MDLog with parameters enjoys similarly nice properties as standard MDLog. For example, containment is decidable. This follows from [RK13, BKR15] where generalizations of MDLog with parameters are studied, the actual parameters being represented by constants.

We also remark that Theorem 19 remains true when we admit constants in MDDLog programs. In fact, the proof of Lemma 17 goes through also when the original MDDLog program contains constants, and both the original and the newly introduced constants can then be removed by Lemma 18.

7.2. Canonical Datalog-Rewritings

We now turn our attention to canonical DLog-rewritings for non-Boolean MDDLog programs. Let $\Pi$ be an $n$ -ary MDDLog program. We associate with $\Pi$ a canonical $(\ell,k)$ -DLog program with $n$ parameters, for any $\ell<k$ . The construction is a refinement of the one from the Boolean case.

We start with some preliminaries. An $n$ -marked instance is an instance $I$ endowed with $n$ (not necessarily distinct) distinguished elements $\mathbf{c}=c_{1},\dots,c_{n}$ . An $(\ell,k)$ -tree decomposition with $n$ parameters of an $n$ -marked instance $(I,\mathbf{c})$ is an $(\ell+m,k+m)$ -tree decomposition of $I$ , $m$ the number of distinct constants in $\mathbf{c}$ , in which every bag $B_{v}$ contains all constants from $\mathbf{c}$ . An $n$ -marked instance has treewidth $(\ell,k)$ with $n$ parameters if it admits an $(\ell,k)$ -tree decomposition with $n$ parameters.

We first convert $\Pi$ into a DDLog program $\Pi^{\prime}$ that is equivalent to $\Pi$ on instances of bounded treewidth. The construction is identical to the Boolean case (first variable identification, then rewriting) except that

(1)

we use treewidth $(\ell+n,k+n)$ in place of treewidth $(\ell,k)$ ; consequently, the arity of the freshly introduced IDB relations may also be up to $\ell+n$ ; 2. (2)

for ${\mathtt{goal}}$ rules, all head variables must occur in the root bag of the tree decomposition (they can then be treated in the same way as a Boolean ${\mathtt{goal}}$ rule despite the $n$ -ary head relation).

It can be verified that $\Pi^{\prime}$ is sound for $\Pi$ and that it is complete for $\Pi$ on $n$ -marked instances of treewidth $(\ell,k)$ with $n$ parameters in the sense that, for all such instances $(I,\mathbf{c})$ , $I\models\Pi[\mathbf{c}]$ implies $I\models\Pi^{\prime}[\mathbf{c}]$ . $\Pi^{\prime}$ is not guaranteed to be complete for answers other than $\mathbf{c}$ because of the way we treat goal rules in Point 2 above, for example when $\Pi$ contains a rule of the form ${\mathtt{goal}}(x,y)\leftarrow A(x)\wedge B(y)$ .

Let $\mathbf{S}^{\prime}_{I}$ denote the additional IDB relations in the resulting program $\Pi^{\prime}$ . We now construct the canonical $(\ell,k)$ -DLog program with $n$ parameters $\Gamma^{c}$ . Fix constants $a_{1},\dots,a_{\ell},$ $b_{1},\dots,b_{n}$ and let $\mathfrak{I}_{\ell^{\prime}+n}$ denote the set of all $\mathbf{S}_{I}\cup\mathbf{S}_{I^{\prime}}$ -instances with domain $\mathbf{a}_{\ell^{\prime},n}:=a_{1},\dots,a_{\ell^{\prime}},$ $b_{1},\dots,b_{n}$ . The program uses $\ell^{\prime}+n$ -ary IDB relations $P_{M}$ , for all $\ell^{\prime}\leq\ell$ and all $M\subseteq\mathfrak{I}_{\ell^{\prime},n}$ . It contains all rules $q(\mathbf{x})\rightarrow P_{M}(\mathbf{y}\,|\,\mathbf{x}_{p})$ , $M\subseteq\mathfrak{I}_{\ell^{\prime},n}$ , that satisfy the following conditions:

(1)

$q(\mathbf{x})$ contains at most $k+n$ variables; 2. (2)

in every extension $J$ of the $\mathbf{S}_{E}$ -instance $I_{q}|_{\mathbf{S}_{E}}$ with $\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I}$ -facts such that

(a)

$J$ satisfies all rules of $\Pi^{\prime}$ and does not contain ${\mathtt{goal}}(\mathbf{x}_{p})$ and 2. (b)

for each $P_{N}(\mathbf{z}\,|\,\mathbf{x}_{p})\in q$ , $N\subseteq\mathfrak{I}_{\ell^{\prime\prime},n}$ , there is an $L\in N$ such that $L[\mathbf{z}\mathbf{x}_{p}/\mathbf{a}_{\ell^{\prime\prime},n}]=J|_{\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I}},\mathbf{z}$

there is an $L\in M$ such that $L[\mathbf{y}\mathbf{x}_{p}/\mathbf{a}_{\ell^{\prime},n}]=J|_{\mathbf{S}_{I}\cup\mathbf{S}^{\prime}_{I}},\mathbf{y}$

We also include all rules of the form $P_{\emptyset}(\mathbf{y}\,|\,\mathbf{x}_{p})\rightarrow{\mathtt{goal}}(\mathbf{x}_{p})$ . This finishes the construction of $\Gamma^{c}$ . It is straightforward to verify that $\Gamma^{c}$ is sound for $\Pi$ . It is complete in the same sense as $\Pi^{\prime}$ .

Lemma 20.

$\Gamma^{c}$ * is sound for $\Pi$ . It is complete for $\Pi$ on $n$ -marked instances of treewidth $(\ell,k)$ with $n$ parameters in the sense that for any such instance $(I,\mathbf{c})$ , $I\models\Pi(\mathbf{c})$ implies $I\models\Gamma^{c}(\mathbf{c})$ .*

The proof of Lemma 20 is similar to that of Lemma 15, details are omitted. In analogy with Theorem 16, we can then obtain the following result about canonical DLog programs.

Theorem 21.

Let $\Pi$ be an $n$ -ary MDDLog program, $0<\ell\leq k$ , and $\Gamma^{c}$ the canonical $(\ell,k)$ -DLog program with $n$ parameters associated with $\Pi$ . Then

(1)

$\Gamma\subseteq\Gamma^{c}$ * for every $(\ell,k)$ -DLog program $\Gamma$ that is sound for $\Pi$ ;* 2. (2)

$\Pi$ * is rewritable into $(\ell,k)$ -DLog with $n$ parameters iff $\Gamma^{c}$ is a rewriting of $\Pi$ .*

Note that, as a consequence of Theorem 21, an $n$ -ary MDDLog program $\Pi$ is DLog-rewritable (in the standard sense, without parameters) iff the canonical $(\ell,k)$ -DLog program with $n$ parameters is a rewriting, for some $\ell,k$ . In a sense, this exactly parallels the behaviour of canonical DLog programs in the Boolean case. As an important consequence, the reductions presented in Lemmas 17 and 18 show that if DLog-rewritability of Boolean programs turns out to be decidable (without assuming equality), then the same is true for DLog-rewritability of non-Boolean programs. Theorems 6 and 13

7.3. Shape of Rewritings and Obstructions

We now analyze the shape of rewritings of non-Boolean MDDLog programs. An $(\ell,k)$ -tree decomposition with $n$ parameters of an $n$ -ary CQ $q$ is an $(\ell+n,k+n)$ -tree decomposition of $q$ in which every bag $B_{v}$ contains all answer variables of $q$ . The treewidth with $n$ parameters of an $n$ -ary CQ is now defined in the expected way.

Theorem 22.

Let $\Pi$ be an $n$ -ary MDDLog program of diameter $k$ . Then

(1)

if $\Pi$ is FO-rewritable, then it has a UCQ-rewriting in which each CQ has treewidth $(1,k)$ with $n$ parameters; 2. (2)

if $\Pi$ is rewritable into MDLog with $n$ parameters, then it has an MDLog-rewriting with $n$ parameters of diameter $k$ .

Note that Theorem 22 is immediate from Theorem 7 and Lemmas 17 and 18 when $k$ denotes the rule size of $\Pi$ instead of its diameter. To get the improved version, one needs to carefully trace the construction of rewritings, starting with rewritings for the CSPs ultimately constructed and then through the proofs of Lemmas 2, 18, and 17. In particular, the constructions in Lemmas 2 and 18 interplay in a subtle way that can be exploited to improve the bound. Details are given in Appendix C.

As in the Boolean case, rewritings are closely related to obstructions. We define obstruction sets for MMSNP formulas with free variables and summarize the results that we obtain for them. A set of marked obstructions $\mathcal{O}$ for an MMSNP formula $\theta$ with $n$ free variables over schema $\mathbf{S}_{E}$ is a set of $n$ -marked instances over the same schema such that for any $\mathbf{S}_{E}$ -instance $I$ , we have $I\not\models\theta[\mathbf{a}]$ iff for some $(O,\mathbf{c})\in\mathcal{O}$ , there is a homomorphism $h$ from $O$ to $I$ with $h(\mathbf{c})=\mathbf{a}$ . We obtain the following corollary from Point 1 of Theorem 22 in exactly the same way in which Corollary 8 is obtained from Point 1 of Theorem 7.

Corollary 23.

For every MMSNP formula $\theta$ with $n$ free variables, the following are equivalent:

(1)

$\theta$ * is FO-rewritable;* 2. (2)

$\theta$ * has a finite marked obstruction set;* 3. (3)

$\theta$ * has a finite set of finite marked obstructions of treewidth $(1,k)$ with $n$ parameters.*

It is interesting to note that this result can be viewed as a generalization of the characterization of obstruction sets for CSP templates with constants in terms of ‘c-acyclicity’ in [AtCKT11]; our parameters correspond to constants in that paper. We now turn to MDLog-rewritability.

Proposition 24.

Let $\theta$ be an MMSNP formula of diameter $k$ with $n$ free variables. Then $\neg\theta$ is rewritable into an MDLog program with $n$ parameters iff $\theta$ has a set of marked obstructions (equivalently: finite marked obstructions) that are of treewidth $(1,k)$ with $n$ parameters.

Proof 7.4.

The “only if” direction is a consequence of Point 2 of Theorem 22 and the fact that, for any MDLog program $\Pi\equiv\neg\theta$ with $n$ parameters of diameter $k$ over EDB schema $\mathbf{S}_{E}$ , a proof tree for ${\mathtt{goal}}(\mathbf{c})$ from an $\mathbf{S}_{E}$ -instance $I$ and $\Pi$ gives rise to a finite $n$ -marked $\mathbf{S}_{E}$ -instance $(J,\mathbf{c})$ of treewidth $(1,k)$ with $n$ parameters that satisfies $J\rightarrow I$ . The “if” direction is a consequence of the fact that the canonical $(1,k)$ -DLog program with parameters associated with $\neg\theta$ viewed as an MDDLog program is complete on inputs of treewidth $(1,k,n)$ with $n$ parameters in the sense of Lemma 20.

As an illustration, it might be interesting to reconsider Example 7.1. The unary MDDLog program shown there is the negation of a unary MMSNP formula that has as a set of marked obstructions the set of all $R$ -cycles on which one element is the marked element. Each of these obstructions has treewidth $(1,2)$ with one parameter, but not treewidth $(1,2)$ in the strict sense.

8. Ontology-Mediated Queries

While the results on disjunctive Datalog and on MMSNP obtained in the previous sections are interesting in their own right, our premier aim is to study fundamental question of rewritability in the context of ontology-mediated queries (OMQs). Such questions have received a lot of interest in the OMQ context, see for example [BtCLW14, BHLW16, LS17] and references therein. In particular, we settle an open question from [BtCLW14] by showing that in the OMQ language $(\mathcal{ALCI},\text{CQ})$ , introduced in detail below, FO-rewritability is decidable and 2NExpTime-complete. In what follows, we first introduce several prominent description logics to serve as ontology languages and, based on that, ontology-mediated queries. We then show how the results from the previous sections can be used to obtain results about ontology-mediated queries.

8.1. Preliminaries

In description logics, ontologies are defined by so-called TBoxes. A TBox, in turn, is a set of inclusions (that is, logical implications) between concepts (that is, logical formulas), and possibly also additional kinds of statements. Each description logic is determined by the constructors that are available to build up concepts and by the statements that are allowed in TBoxes. Here, we introduce the widely known description logics $\mathcal{ALC}$ , $\mathcal{ALCI}$ , and $\mathcal{SHI}$ , listed in the order of increasing expressive power. We refer the reader to [BHLS17] for a more thorough introduction to DLs.

An $\mathcal{ALCI}$ -concept is formed according to the syntax rule

[TABLE]

where $A$ ranges over a fixed countably infinite set of concept names and $r$ over a fixed countably infinite set of role names. An $\mathcal{ALC}$ -concept is an $\mathcal{ALCI}$ -concept in which the constructors $\exists r^{-}.C$ and $\forall r^{-}.C$ are not used. An $\mathcal{ALC}$ -TBox (resp. $\mathcal{ALCI}$ -TBox) is a finite set of concept inclusions $C\sqsubseteq D$ , $C$ and $D$ $\mathcal{ALC}$ -concepts (resp. $\mathcal{ALCI}$ -concepts). While $\mathcal{ALCI}$ extends $\mathcal{ALC}$ with additional concept constructors, $\mathcal{SHI}$ extends $\mathcal{ALCI}$ with additional types of TBox statements. There is thus no need to define $\mathcal{SHI}$ -concepts as these are simply $\mathcal{ALCI}$ -concepts. A role is either a role name or an expression $r^{-}$ with $r$ a role name. A $\mathcal{SHI}$ -TBox is a finite set of

•

concept inclusions $C\sqsubseteq D$ , $C$ and $D$ $\mathcal{ALCI}$ -concepts,

•

role inclusions $r\sqsubseteq s$ , $r$ and $s$ roles, and

•

transitivity statements ${\mathtt{trans}}(r)$ , $r$ a role name.

DL semantics is given in terms of interpretations. An interpretation takes that form $\mathcal{I}=(\Delta^{\mathcal{I}},\cdot^{\mathcal{I}})$ where $\Delta^{\mathcal{I}}$ is a non-empty set called the domain and $\cdot^{\mathcal{I}}$ is the interpretation function which maps each concept name $A$ to a subset $A^{\mathcal{I}}\subseteq\Delta^{\mathcal{I}}$ and each role name $r$ to a binary relation $r^{\mathcal{I}}\subseteq r^{\mathcal{I}}\times r^{\mathcal{I}}$ . Note that an interpretation is simply a notational variant of a relational FO-structure that interprets only unary and binary relations. The interpretation function is extended to compound concepts in the standard way, as given in Figure 2.

An intepretation is a model of a TBox $\mathcal{T}$ if it satisfies all statements in $\mathcal{T}$ , that is,

•

$C\sqsubseteq D\in\mathcal{T}$ implies $C^{\mathcal{I}}\subseteq D^{\mathcal{I}}$ ;

•

$r\sqsubseteq s\in\mathcal{T}$ implies $r^{\mathcal{I}}\subseteq s^{\mathcal{I}}$ ;

•

${\mathtt{trans}}(r)\in\mathcal{T}$ implies that $r^{\mathcal{I}}$ is transitive.

For roles $r,s$ , we write $\mathcal{T}\models r\sqsubseteq s$ if every model $\mathcal{I}$ of $\mathcal{T}$ satisfies $r^{\mathcal{I}}\subseteq s^{\mathcal{I}}$ .

In description logic, data is typically stored in so-called ABoxes. For uniformity with MDDLog, we use instances instead, identifying unary relations with concept names, binary relations with role names, and disallowing relations of any other arity. An interpretation $\mathcal{I}$ is a model of an instance $I$ if $A(a)\in I$ implies $a\in A^{\mathcal{I}}$ and $r(a,b)\in I$ implies $(a,b)\in r^{\mathcal{I}}$ . We say that an instance $I$ is consistent with a TBox $\mathcal{T}$ if $I$ and $\mathcal{T}$ have a joint model.

An ontology-mediated query (OMQ) over a schema $\mathbf{S}_{E}$ is a triple $(\mathcal{T},\mathbf{S}_{E},q)$ where $\mathcal{T}$ is a TBox formulated in a description logic and $q$ is a query. The TBox can introduce symbols that are not in $\mathbf{S}_{E}$ , which allows it to enrich the schema available for formulating the query $q$ . In fact, $q$ can use symbols from $\mathbf{S}_{E}$ , additional symbols from $\mathcal{T}$ , and also completely fresh symbols (which is useful only in very rare cases). As the TBox language, we may use any of the description logics introduced above. Since all these logics admit only unary and binary relations, we assume that these are the only allowed arities in schemas throughout Section 8. As the actual query language, we use UCQs and CQs. The OMQ languages that these choices give rise to are denoted with $(\mathcal{ALC},\text{CQ})$ , $(\mathcal{SHI},\text{UCQ})$ , and so on. In the actual query, we generally disallow the use of role names $r$ such that for some role name $s$ , ${\mathtt{trans}}(s)\in\mathcal{T}$ and $\mathcal{T}\models s\sqsubseteq r$ . In fact, admitting such roles in the query poses serious additional complications, which are outside the scope of this paper; see e.g. [BEL*+*10, GPT13]. To make the restriction explicit, we add a superscript $\cdot^{-}$ to OMQ languages when the DL used permits transitivity statements in the TBox, such as in $(\mathcal{SHI},\text{UCQ})^{-}$ .

The semantics of an OMQ is given in terms of certain answers. Let $I$ be an $\mathbf{S}_{E}$ -instance and $\mathbf{a}$ a tuple of constants from $I$ . We write $I\models Q(\mathbf{a})$ and call $\mathbf{a}$ a certain answer to $Q$ on $I$ if for all models $\mathcal{I}$ of $I$ and $\mathcal{T}$ , we have $\mathcal{I}\models q(\mathbf{a})$ . The latter denotes satisfaction of $q(\mathbf{a})$ in $\mathcal{I}$ in the usual sense of first-order logic. {exa} Let $Q=(\mathcal{T},\mathbf{S}_{E},q)$ be the following OMQ, formulated in $(\mathcal{ALC},\text{CQ})$ :

[TABLE]

The TBox $\mathcal{T}$ describes the risk of somebody having medullary thyroid cancer (MTC) in the presence of an abnormal calcitonin test (CTest). While abnormal calcitonin levels are a marker for MTC, there can also be false positives, for example due to smoking (Line 1 of $\mathcal{T}$ ). However, in the presence of a high risk for the genetic syndrome MEN2, high calcitonin levels immediately raise an MTC suspicion (Line $2$ ). Pheochromocitoma patients (PCCPatient) have a high MEN2 risk (Line $3$ ). As MEN2 is caused by a genetic mutation, the risk carries within families (Line $4$ ). On the $\mathbf{S}_{E}$ -instance

[TABLE]

the only certain answer to $Q$ is ${\mathtt{john}}$ .

An OMQ $Q=(\mathcal{T},\mathbf{S}_{E},q)$ is FO-rewritable if there is an FO query $\varphi(x)$ over schema $\mathbf{S}_{E}$ (and possibly involving equality), called an FO-rewriting of $Q$ , such that for all $\mathbf{S}_{E}$ -instances $I$ and $\mathbf{a}\subseteq\mathsf{dom}(I)$ , we have $I\models Q(\mathbf{a})$ iff $I\models\varphi(\mathbf{a})$ . Other notions of rewritability such as UCQ-rewritability and MDLog-rewritability are defined accordingly.

Note that the TBox $\mathcal{T}$ can be inconsistent with the input instance $I$ , that is, there could be no joint model of $\mathcal{T}$ and $I$ . It can thus be a sensible alternative to work with consistent FO-rewritability, considering only $\mathbf{S}_{E}$ -instances $I$ that are consistent w.r.t. $\mathcal{T}$ . This can then be complemented with rewritability of inconsistency for $\mathcal{T}$ , that is, rewritability of the Boolean OMQ $(\mathcal{T},\mathbf{S}_{E},\exists x\,A(x))$ , $A(x)$ a fresh concept name, which is true on an $\mathbf{S}_{E}$ -instance $I$ iff $I$ is inconsistent with $\mathcal{T}$ . It is not hard to prove, though, that consistent $\mathcal{Q}$ -rewritability can be reduced to $\mathcal{Q}$ -rewritability in polynomial time for all OMQ languages condidered in this paper and all $\mathcal{Q}\in\{\text{FO},\text{MDLog},\text{DLog}\}$ ; see the corresponding proof for query containment in [BL16]. Moreover, rewritability of consistency was studied in [BtCLW14] and shown to be NExpTime-complete for all OMQ languages considered in this paper.

8.2. Rewritability of OMQs

We now lift the results from earlier sections to OMQs. There is a known equivalence-preserving translation from the relevant OMQ languages to MDDLog, but it involves a double exponential blowup [BtCLW14] that most likely is unavoidable.444It was shown in [BtCLW14] that a single exponential blowup is unavoidable. Whether the blowup has to be double exponential is an open problem. We refine this translation and carefully trace the parameters in which the blowup occurs to show that, despite these blowups, the complexity of the relevant problems does not increase. The following is our main result concerning OMQs.

Theorem 25.

In all OMQ languages between $(\mathcal{ALC},\text{UCQ})$ and $(\mathcal{SHI},\text{UCQ})^{-}$ , as well as between $(\mathcal{ALCI},\text{CQ})$ and $(\mathcal{SHI},\text{CQ})^{-}$ ,

(1)

FO-rewritability (equivalently: UCQ-rewritability) is 2NExpTime*-complete; in fact, there is an algorithm which, given an OMQ $Q=(\mathcal{T},\mathbf{S}_{E},q)$ , decides in time $2^{2^{p(n_{q}\cdot{\mathtt{log}}n_{\mathcal{T}})}}$ whether $Q$ is FO-rewritable;* 2. (2)

MDLog-rewritability is in 3ExpTime* (and 2NExpTime-hard); in fact, there is an algorithm which, given an OMQ $Q=(\mathcal{T},\mathbf{S}_{E},q)$ , decides in time $2^{2^{2^{p(n_{q}\cdot{\mathtt{log}}n_{\mathcal{T}})}}}$ whether $Q$ is MDLog-rewritable*

where $n_{q}$ and $n_{\mathcal{T}}$ are the size of $q$ and $\mathcal{T}$ and $p$ is a polynomial.

Note that the runtime for deciding FO-rewritability stated in Theorem 25 is double exponential only in the size of the actual query $q$ (which tends to be very small) while it is only single exponential in the size of the TBox (which can become large) and similarly for MDLog-rewritability, only one exponential higher.

The lower bounds in Theorem 25 are from [BL16]. To prove the upper bounds, we first give a refined translation from OMQs to MDDLog. A proof is provided in Appendix D.

Theorem 26.

For every OMQ $Q=(\mathcal{T},\mathbf{S}_{E},q)$ from $(\mathcal{SHI},\text{UCQ})^{-}$ , one can construct an equivalent MDDLog program $\Pi$ such that

(1)

the size of $\Pi$ is bounded by $2^{2^{p(n_{q}\cdot{\mathtt{log}}n_{\mathcal{T}})}}$ ; 2. (2)

the IDB schema of $\Pi$ is of size at most $2^{p(n_{q}\cdot{\mathtt{log}}n_{\mathcal{T}})}$ ; 3. (3)

the rule size of $\Pi$ is bounded by $n_{q}$

where $n_{q}$ and $n_{\mathcal{T}}$ are the size of $q$ and $\mathcal{T}$ and $p$ is a polynomial. The construction takes time polynomial in the size of $\Pi$ .

Let $Q$ be an OMQ from $(\mathcal{SHI},\text{UCQ})^{-}$ . Instead of deciding FO- or MDLog-rewritability of $Q$ , we can decide the same problem for the MDDLog program delivered by Theorem 26. The bounds stated in Theorem 26, Lemmas 17 and 18, and Theorems 1 and 1, though, only guarantee that we obtain a CSP template with 3-exponentially many elements, which does not yield 2NExpTime upper bounds. However, it is possible to combine the construction underlying Theorem 26 with those underlying Lemmas 17 and 18 and Theorem 1 to obtain the following.

Lemma 27.

Given an OMQ $Q=(\mathcal{T},\mathbf{S}_{E},q)$ from $(\mathcal{SHI},\text{UCQ})^{-}$ with $\mathcal{T}$ of size $n_{\mathcal{T}}$ and $q$ of size $n_{q}$ , one can construct a simple MDDLog program $\Pi_{Q}$ over an aggregation EDB schema $\mathbf{S}^{\prime}_{E}$ such that

(1)

$Q$ * is $\mathcal{Q}$ -rewritable iff $\Pi_{Q}$ is $\mathcal{Q}$ -rewritable for every $\mathcal{Q}\in\{\text{FO},\text{UCQ},\text{MDLog}\}$ ;* 2. (2)

the size of $\Pi_{Q}$ and the cardinality of $\mathbf{S}^{\prime}_{E}$ are bounded by $2^{2^{p(n_{q}\cdot{\mathtt{log}}n_{\mathcal{T}})}}$ and the arity of relations in $\mathbf{S}^{\prime}_{E}$ is bounded by $\max\{n_{q},2\}$ ; 3. (3)

the IDB schema of $\Pi_{Q}$ is of size $2^{p(n_{q}\cdot{\mathtt{log}}n_{\mathcal{T}})}$

where $p$ is a polynomial. The construction takes time polynomial in the size of $\Pi_{Q}$ .

A proof is provided in Appendix E. Constructing a CSP template from this refined simple program and applying the decision procedures for rewritability of CSP templates, we obtain the upper bounds stated in Theorem 1.

We remark that it is not possible to extend Theorem 25 to description logics with functional roles or number restrictions since, in such DLs, FO-rewritability of OMQs is undecidable [BtCLW14]. The proof can be adapted to MDLog-rewritability.

The results about the shape of rewritings stated in Theorem 22 (of course) also apply to the OMQ case. Note that, in Points 1 and 2 of that theorem, we can then replace $k$ with $\max\{n_{q},2\}$ . Moreover, the canonical DLog programs introduced for MDDLog in Section 7 can also be utilized for OMQs via the translation underlying the proof of Theorem 26.

Regarding Datalog-rewritability of OMQs, we obtain a potentially incomplete decision procedure by combining Theorem 26 with Lemmas 17 and 18 and the algorithm from Section 6. It is possible to define a class of OMQs $(\mathcal{T},\mathbf{S}_{E},q)$ that have equality and for which this procedure is complete. Roughly, $\mathbf{S}_{E}$ needs to contain a relation ${\mathtt{eq}}$ and $\mathcal{T}$ enforces that for all models $\mathcal{I}$ of $\mathcal{T}$ and all $(d,e)\in{\mathtt{eq}}^{\mathcal{I}}$ , $d$ and $e$ satisfy exactly the same subconcepts of $\mathcal{T}$ and exactly the same tree contractions of $q$ and then taking a subquery. We refrain from working out the details.

9. Dichotomy and Deciding PTime Query Evaluation

There was a recent breakthrough in research on CSPs, independently achieved by Bulatov and by Zhuk, who have proved the long standing Feder-Vardi conjecture thus establishing a dichotomy between PTime and NP for the complexity of CSPs [Bul17, Zhu17]. Together with results by Chen and Larose [CL17], this also implies that it is decidable and NP-complete whether the CSP defined by a given template has PTime complexity. We observe that, together with the translations given in this paper, we obtain several interesting results on MMSNP, MDDLog, and OMQs.

In particular, we consider the (data) complexity of query evaluation, which is defined in the expected way. For example, each OMQ $Q=(\mathcal{T},\mathbf{S}_{E},q)$ gives rise to the following query evaluation problem: given an $\mathbf{S}_{E}$ -instance $I$ and a tuple $\mathbf{a}\subseteq\mathsf{dom}(I)$ , decide whether $I\models Q(\mathbf{a})$ . This problem is guaranteed to be in coNP when $Q$ is from any of the OMQ languages studied in this paper [BtCLW14], but of course there are also OMQs $Q$ for which it is in PTime or even simpler and from a practical perspective it is very important to understand the exact complexity of evaluating the concrete queries that are relevant for the application at hand. The definition of query evaluation is analogous for MDDLog and for MMSNP; note that MMSNP only gives rise to Boolean queries and that there is an NP upper bound for the complexity rather than a coNP one.

The question of PTime query evaluation also comes with an associated ‘meta problem’: given an OMQ $Q$ (or a query from some other relevant language), decide whether $Q$ admits PTime query evaluation. We remark that the data complexity of OMQs as well as the associated meta problem and dichotomy questions have received significant interest [BtCLW14, LW17, LS17, HLW17, LSW15, LSW13, CDL*+*13]. The following theorem summarizes our results regarding the complexity of query evaluation.

Theorem 28.

In MDDLog and all OMQ languages between $(\mathcal{ALC},\text{UCQ})$ and $(\mathcal{SHI},\text{UCQ})^{-}$ , as well as between $(\mathcal{ALCI},\text{CQ})$ and $(\mathcal{SHI},\text{CQ})^{-}$ ,

(1)

there is a dichotomy in the complexity of query evaluation between PTime and coNP; 2. (2)

deciding PTime-query evaluation is 2NExpTime*-complete.*

The same holds for MMSNP, with coNP replaced by NP in Point (1).

Proof 9.1.

For MMSNP, it is well-known that there is a dichotomy for query evaluation between PTime and NP iff there is such a dichotomy for the complexity of CSPs. This gives the MMSNP version of (1). In fact, it is even known that the constructions from the proofs of Theorems 1 and 1, which transform an MMSNP sentence into a CSP, preserve complexity up to polynomial time reductions [FV98, Kun13]. We thus obtain the upper bound in the MMSNP version of (2). The lower bound is obtained from the reduction used in [BL16] to show that the Datalog-rewritability of (the complement of) MMSNP sentences is 2NExpTime-hard. The proof is by reduction of a tiling problem. Given such a tiling problem, one constructs an MMSNP sentence $\varphi$ such that $\varphi$ is FO-rewritable if there is a tiling and equivalent to 3-colorability (thus not Datalog-rewritable and NP-hard) otherwise. Clearly, such a reduction also yields 2NExpTime-hardness of PTime query evaluation.

For the cases of MDDLog and for OMQs, lower bounds are obtained along the same lines, that is, by observing that the lower bound constructions from [BL16] are directly applicable. For the upper bounds and the dichotomies, we first observe that the construction in the proofs of Lemma 18 preserves complexity up to polynomial time reductions (which is implicit in the proof) and recall that the translation in Lemma 27 is even equivalence-preserving. It thus remains to deal with Lemma 17. There, an $n$ -ary MDDLog program $\Pi$ is translated into a family of Boolean MDDLog programs (with constants) $\Pi_{\mathbf{c}}$ , $\mathbf{c}\in C^{n}$ where $C$ is fixed set of $n$ constants. The claim formulated in the proof of Lemma 17 provides

(1)

a polynomial time reduction of evaluating $\Pi$ to evaluating the programs $\Pi_{\mathbf{c}}$ : given an instance $I$ and an $\mathbf{a}\subseteq{\mathtt{Ind}}(I)$ , to decide whether $I\models\Pi(\mathbf{a})$ choose $\mathbf{c}$ with $\mathbf{a}\approx\mathbf{c}$ and check whether $I[\mathbf{c}/\mathbf{a}]\models\Pi_{\mathbf{c}}$ ; 2. (2)

for each $\mathbf{c}\in C^{n}$ , a polynomial time reduction of evaluating $\Pi_{\mathbf{c}}$ to evaluating $\Pi$ : given an instance $I$ , to decide whether $I\models\Pi_{\mathbf{c}}$ check whether $I\models\Pi(\mathbf{c})$ .

It follows that $\Pi$ can be evaluated in PTime iff all of the programs $\Pi_{\mathbf{c}}$ can. This is enough to transfer the upper bound for deciding PTime query evaluation. For dichotomy, we additionally need that if one of the programs $\Pi_{\mathbf{c}}$ is coNP-hard, then so is $\Pi$ , which follows from (2).

10. Discussion

We have clarified the decidability status and computational complexity of FO- and MDLog-rewritability in MMSNP, MDDLog, and various OMQ languages based on expressive description logics and conjunctive queries, and we also made several interesting observations regarding dichotomies and the decidability and complexity of PTime query evaluation. For Datalog-rewritability, we were only able to obtain partial results, namely a sound algorithm that is complete on a certain class of inputs and potentially incomplete in general. This raises several natural questions: is our algorithm actually complete in general? Does an analogue of Lemma 4 (that is, rewritability on instances of high girth implies rewritability) hold for Datalog as a target language? What is the complexity of deciding Datalog-rewritability in the afore-mentioned languages? From an OMQ perspective, it would also be important to work towards more practical approaches for computing (FO-, MDLog-, and DLog-) rewritings. Given the high computational complexities involved, such approaches might have to be incomplete to be practically feasible. However, the degree/nature of incompleteness should then be characterized, and we expect the results in this paper to be helpful in such an endeavour.

Appendix A Translating Boolean MDDLog to generalized CSP

A.1. From MDDLog to Simple MDDLog

Let $\Pi$ be a Boolean MDDLog program over schema $\mathbf{S}_{E}$ and of diameter $k$ . We first construct from $\Pi$ an equivalent Boolean MDDLog program $\Pi_{B}$ such that the following conditions are satisfied:

(i)

all rule bodies are biconnected, that is, when any single variable is removed from the body (by deleting all atoms that contain it), then the resulting rule body is still connected; 2. (ii)

if $R(x,\dots,x)$ occurs in a rule body with $R$ EDB, then the body contains no other EDB atoms.

A good way to think about what is achieved in this first step is that, when the resulting program is evaluated on an instance of bounded treewidth, then it suffices to map the rule bodies to individual bags while it is never necessary to cross ‘bag boundaries’.

To construct $\Pi_{B}$ , we first extend $\Pi$ with all contractions of rules in $\Pi$ ; we will refer to this step as the collapsing step. We then split up rules that are not biconnected into multiple rules by exhaustively executing the following rewriting steps:

•

replace every rule $p(\mathbf{y})\leftarrow q_{1}(\mathbf{x}_{1})\wedge q_{2}(\mathbf{x}_{2})$ where $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$ share exactly one variable $x$ but both contain also other variables with the rules $p_{1}(\mathbf{y}_{1})\vee Q(x)\leftarrow q_{1}(\mathbf{x}_{1})$ and $p_{2}(\mathbf{y}_{2})\leftarrow Q(x)\wedge q_{2}(\mathbf{x}_{2})$ , where $Q$ is a fresh monadic IDB relation and $p_{i}(\mathbf{y}_{i})$ is the restriction of $p(\mathbf{y})$ to atoms that are nullary or contain a variable from $\mathbf{x}_{i}$ , $i\in\{1,2\}$ ;

•

replace every rule $p(\mathbf{y})\leftarrow q_{1}(\mathbf{x}_{1})\wedge q_{2}(\mathbf{x}_{2})$ where $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$ share no variables and are both non-empty with the rules $p_{1}(\mathbf{y}_{1})\vee Q()\leftarrow q_{1}(\mathbf{x}_{1})$ and $p_{2}(\mathbf{y}_{2})\leftarrow Q()\wedge q_{2}(\mathbf{x}_{2})$ , where $Q()$ is a fresh nullary IDB relation and the $p_{i}(\mathbf{y}_{i})$ are as above;

•

replace every rule $p(\mathbf{y})\leftarrow R(x,\dots,x)\wedge q(\mathbf{x})$ where $R$ is an EDB relation and $q$ contains at least one EDB atom and the variable $x$ , with the rules $Q(x)\leftarrow R(x,\dots,x)$ and $p(\mathbf{y})\leftarrow Q(x)\wedge q(\mathbf{x})$ , where $Q$ is a fresh monadic IDB relation.

It is easy to see that the resulting program $\Pi_{B}$ is equivalent to the original program $\Pi$ and that all $\Pi_{B}$ satisfies Conditions (i) and (ii) above.

We next construct from $\Pi_{B}$ the desired simple program $\Pi_{S}$ by replacing, in every rule, the EDB atoms in the rule body with a single EDB atom that represents the conjunction of all atoms replaced. We thus introduce fresh EDB relations that represent conjunctions of old EDB relations. Note that there can be implications between the new EDB relations that we will have to take care of in the construction of $\Pi_{B}$ .

Let $Q_{\Pi}$ denote the set of CQs that can be obtained from a rule body in $\Pi_{B}$ by consistently renaming variables, using only variables that occur in $\Pi_{B}$ . Let $\mathbf{S}_{I}$ be the IDB schema of $\Pi_{B}$ . For every $q(\mathbf{x})\in\mathcal{Q}_{\Pi}$ , we write $q(\mathbf{x})|_{\mathbf{S}_{E}}$ to denote the restriction of $q(\mathbf{x})$ to $\mathbf{S}_{E}$ -atoms, and likewise for $q(\mathbf{x})|_{\mathbf{S}_{I}}$ and IDB atoms. The EDB schema $\mathbf{S}^{\prime}_{E}$ of $\Pi_{S}$ consists of the relations $R_{q(\mathbf{x})|_{\mathbf{S}_{E}}}$ , $q(\mathbf{x})\in\mathcal{Q}_{\Pi}$ , whose arity is the number of variables in $q(\mathbf{x})$ (which, by construction of $\Pi_{B}$ , is identical to the number of variables in $q(\mathbf{x})|_{\mathbf{S}_{E}}$ ). The program $\Pi_{S}$ consists of the following rules:

whenever $p(\mathbf{y})\leftarrow q_{1}(\mathbf{x}_{1})$ is a rule in $\Pi_{B}$ , $q_{2}(\mathbf{x}_{2})\in\mathcal{Q}_{\Pi}$ , and $q_{1}(\mathbf{x}_{1})\subseteq q_{2}(\mathbf{x}_{2})$ , then $\Pi_{S}$ contains the rule $p(\mathbf{y})\leftarrow R_{q_{2}(\mathbf{x}_{2})}(\mathbf{x}_{2})\wedge q_{1}(\mathbf{x}_{1})|_{\mathbf{S}_{I}}$

The case where $q_{1}(\mathbf{x}_{1})$ is identical to $q_{2}(\mathbf{x}_{2})$ corresponds to adapting rules in $\Pi_{B}$ to the new EDB signature and the other cases take care of implications between EDB relations. {exa}

Assume that $\Pi_{B}$ contains the following rules, where $A$ and $r$ are EDB relations:

[TABLE]

A new ternary EDB relation $R_{q_{2}}$ is introduced for the EDB body atoms of the lower rule, where $q_{2}=r(x_{1},x_{2})\wedge r(x_{2},x_{3})\wedge r(x_{3},x_{1})$ , and a new ternary EDB relation $R_{q_{1}}$ is introduced for the upper rule, $q_{1}=A(x_{1})\wedge q_{2}$ . In $\Pi_{S}$ , the rules are replaced with

[TABLE]

Note that $q_{2}\subseteq q_{1}$ and thus $q_{1}$ logically implies $q_{2}$ , which results in two copies of the goal rule to be generated.

Proof details for the following lemma can be found in [BL16]. Recall that $k$ is the diameter of the original MDDLog program $\Pi$ .

Lemma 29.

(1)

If $I$ is an $\mathbf{S}_{E}$ -instance and $I^{\prime}$ the corresponding $\mathbf{S}^{\prime}_{E}$ -instance, then $I\models\Pi$ iff $I^{\prime}\models\Pi_{S}$ ; 2. (2)

If $I^{\prime}$ is an $\mathbf{S}^{\prime}_{E}$ -instance and $I$ the corresponding $\mathbf{S}_{E}$ -instance, then

(a)

$I^{\prime}\models\Pi_{S}$ * implies $I\models\Pi$ ;* 2. (b)

$I\models\Pi$ * implies $I^{\prime}\models\Pi_{S}$ if the girth of $I^{\prime}$ exceeds $k$ .*

The following example demonstrates why the restriction to high girth instances in Point 2b of Lemma 29 is necessary, see also Example 3. {exa}

Consider the programs $\Pi_{B}$ and $\Pi_{S}$ from Example A.1. Take the $\mathbf{S}^{\prime}_{E}$ -instance $I^{\prime}$ defined by

[TABLE]

It can be verified that $I^{\prime}\not\models\Pi_{S}$ . But the corresponding $\mathbf{S}_{E}$ -instance $I$ is such that $\Pi_{B}$ derives the IDB relation $P$ at $a^{\prime}$ , $b^{\prime}$ , and $c^{\prime}$ , and additionally $I$ contains the facts

[TABLE]

which are not covered by any fact in $I^{\prime}$ . Thus clearly $I\models\Pi_{B}$ .

A.2. From Simple MDDLog to Generalized CSP

Let $\Pi$ be a simple MDDLog program over EDB schema $\mathbf{S}_{E}$ and with IDB schema $\mathbf{S}_{I}$ . For $i\in\{0,1\}$ , an i-type is a set $t$ of relation symbols from $\mathbf{S}_{I}$ of arity at most $i$ that does not contain ${\mathtt{goal}}()$ and that satisfies all rules in $\Pi$ which use only IDB relations of arity at most $i$ and do not involve any EDB relations.

We build a template $T_{\theta}$ for each 0-type $\theta$ . The elements of $T_{\theta}$ are exactly the 1-types that agree with $\theta$ on nullary IDB relations. $T_{\theta}$ consists of the following facts:

(1)

$P()$ for each nullary $P\in\theta$ . 2. (2)

$P(t)$ for each 1-type $t$ and each monadic $P\in t$ ; 3. (3)

$R(t_{1},\dots,t_{n})$ for each relation $R\in\mathbf{S}_{E}$ and all 1-types $t_{1},\dots,t_{n}$ such that $\Pi$ does not contain a rule

[TABLE]

such that $P_{j}\in t_{i_{j}}$ for $1\leq j\leq n$ , and $P\notin t_{i}$ .

The following was observed in [FV98].

Lemma 30.

For any $\mathbf{S}_{E}$ -instance $I$ , we have $I\models\Pi$ iff $I\not\rightarrow T_{\theta}$ for all [math]-types $\theta$ .

Appendix B MDLog-Rewritability of Generalized CSP

In the proof of the subsequent theorem, we use obstructions of CSPs, which are defined in Section 5.

Theorem 31.

*Given a finite set of templates $S$ , it can be decided in ExpTime whether coCSP $(S)$ is MDLog-rewritable. *

Proof B.1.

Consider coCSP $(S)$ over schema $\mathbf{S}_{E}$ with $S:=\{T_{1},\dots,T_{n}\}$ . We start with observing that we can assume that the templates in $S$ are mutually homomorphically incomparable: if this is not the case, we remove templates that are not homomorphically minimal and further remove templates so that none of the remaining templates are homomorphically equivalent. Clearly, this is equivalence preserving and can be done in ExpTime.

We aim to show that coCSP $(S)$ is MDLog-rewritable if and only if coCSP $(T_{i})$ is for all $i\in\{1,\dots,n\}$ , which gives the desired ExpTime upper bound. The “if” direction is immediate since the union of MDLog programs is expressible as an MDLog program. For the “only if” direction, assume that coCSP( $S$ ) is MDLog-rewritable, and let $\Gamma$ denote a concrete rewriting. Consider a template $T_{j}$ and let $\mathcal{O}(T_{j})$ denote the set of all finite $\mathbf{S}_{E}$ -instances of treewidth $(1,k)$ that do not homomorphically map to $T_{j}$ where $k$ is the maximum number of variables that occur in a single rule of $\Gamma$ . We will show that $\mathcal{O}(T_{j})$ is an obstruction set for $T_{j}$ . It then follows from Theorem 23 of [FV98], which says that the existence of an obstruction set of treewidth $(1,k)$ for some fixed $k$ implies MDLog-rewritability, that coCSP $(T_{j})$ is MDLog-rewritable.

By definition of $\mathcal{O}(T_{j})$ , it is immediate that if $O\rightarrow I$ for some $O\in\mathcal{O}(T_{j})$ and $\mathbf{S}_{E}$ -instance $I$ , then $I\not\rightarrow T_{j}$ . We now establish the converse. Assume that $I\not\rightarrow T_{j}$ . Consider the disjoint union $U$ of $I$ and $T_{j}$ . Since the templates in $S$ are homomorphically incomparable, $U\not\rightarrow T_{i}$ for all $i\in\{1,\dots,n\}$ . Thus $U\models\Gamma$ and there is a proof tree for ${\mathtt{goal}}()$ from $U$ and $\Gamma$ . From that tree, we can read off an $\mathbf{S}_{E}$ -instance $J$ such that $J\rightarrow U$ , $J$ has treewidth $(1,k)$ , and $J\models\Gamma$ . From the latter, we get $J\not\rightarrow T_{j}$ . There must thus also be a connected component $O$ of $J$ with $O\not\rightarrow T_{j}$ . We clearly have $O\in\mathcal{O}(T_{j})$ . Since $O\rightarrow U$ , $O\not\rightarrow T_{j}$ , and $O$ is connected, we moreover get $O\rightarrow I$ which finishes the proof.

Appendix C Proof of Theorem 22

See 22

Proof C.1.

We treat the two cases, FO-rewritability and MDLog-rewritability with parameters, in parallel in a uniform way. To achieve uniformity, recall that FO-rewritability coincides with UCQ-rewritability by Proposition 1 and observe that a UCQ-rewriting of treewidth $(1,k)$ with $n$ parameters can be converted into a non-recursive MDLog-rewriting with $n$ parameters of diameter $k$ and vice versa. We work with the latter.

Assume that an $n$ -ary MDDLog program $\Pi$ over EDB schema $\mathbf{S}_{E}$ is rewritable into (non-recursive) MDLog with $n$ parameters. We can convert

(1)

$\Pi$ * into Boolean MDDLog programs $\Pi_{1},\dots,\Pi_{k}$ with constants (Lemma 17),* 2. (2)

$\Pi_{1},\dots,\Pi_{k}$ * into Boolean MDDLog programs $\Pi^{\prime}_{1},\dots,\Pi^{\prime}_{k}$ without constants (Lemma 18),* 3. (3)

$\Pi^{\prime}_{1},\dots,\Pi^{\prime}_{k}$ * into simple Boolean MDDLog programs $\Pi^{\prime\prime}_{1},\dots,\Pi^{\prime\prime}_{k}$ (Theorem 1), and* 4. (4)

$\Pi^{\prime\prime}_{1},\dots,\Pi^{\prime\prime}_{k}$ * into CSP templates $T_{1},\dots,T_{k}$ (Theorem 1)*

such that all these programs and (complements of) templates are rewritable into (non-recursive) MDLog. Moreover, in the proofs of the mentioned lemmas and theorems, it is shown how to construct (non-recursive) MDLog-rewritings of $\Pi^{\prime\prime}_{1},\dots,\Pi^{\prime\prime}_{k}$ from given ones of $T_{1},\dots,T_{k}$ , for $\Pi^{\prime}_{1},\dots,\Pi^{\prime}_{k}$ from given ones of $\Pi^{\prime\prime}_{1},\dots,\Pi^{\prime\prime}_{k}$ , and so on. We are going to analyze these constructions in more detail.

We first note that for any (non-recursive) MDLog-rewritable CSP, there is a (non-recursive) MDLog-rewriting where every rule body has at most one EDB atom that contains all variables which occur in the rule body. Since each program $\Pi^{\prime\prime}_{i}$ is actually equivalent to the complement of the CSP template $T_{i}$ in Step 4, the same is true for the programs $\Pi^{\prime\prime}_{i}$ . Thus, there is a (non-recursive) MDLog-rewriting $\Gamma^{\prime\prime}_{i}$ of $\Pi^{\prime\prime}_{i}$ in which

( $\dagger$ )

each rule body has at most one EDB atom that contains all variables.

The translation of $\Pi^{\prime}_{i}$ into $\Pi^{\prime\prime}_{i}$ in Step 3 involves replacing the EDB schema $\mathbf{S}_{E}$ with an aggregation schema $\mathbf{S}^{\prime}_{E}$ . More precisely, $\mathbf{S}^{\prime}_{E}$ consists of relations $R_{q(x)}$ where $q(x)$ is obtained from a rule body in $\Pi^{\prime}_{i}$ by first contracting, then splitting up the body into biconnected components, and finally dropping all IDB relations. When translating the rewriting $\Gamma^{\prime\prime}_{i}$ of $\Pi^{\prime\prime}_{i}$ into a rewriting $\Gamma^{\prime}_{i}$ of $\Pi^{\prime}_{i}$ , this change in schema is reverted. By ( $\dagger$ ), the diameter of $\Gamma^{\prime}_{i}$ is thus bounded by the arity of relations in $\Gamma^{\prime\prime}_{i}$ and that arity, in turn, is bounded by the diameter of $\Pi^{\prime}_{i}$ . What’s more important, though, is that we actually know what the rule bodies in $\Gamma^{\prime}_{i}$ look like:

( $\ddagger$ )

every rule body in $\Gamma^{\prime}_{i}$ is obtained from a rule body in $\Pi^{\prime}_{i}$ by first contracting, then splitting up the body into biconnected components, then dropping all IDB relations, and finally decorating with some fresh IDB relations without introducing fresh variables.

Now consider the translation of $\Pi_{i}$ into $\Pi^{\prime}_{i}$ in Step 2 and the corresponding translation of $\Gamma^{\prime}_{i}$ into a rewriting $\Gamma_{i}$ of $\Pi_{i}$ . In the former, we dejoin rule bodies by (sometimes) replacing different occurrences of the same variable $x$ with different variables $x_{1},x_{2}$ and adding the atoms $R_{j}(x_{1})$ and $R_{j}(x_{2})$ for some $j$ , thus increasing the diameter. In the latter, we rejoin the dejoined rules in $\Gamma^{\prime}_{i}$ in the sense that we replace variables $x,y$ with the same constant $c_{j}$ whenever the rule body contains the (EDB) atoms $R_{j}(x)$ and $R_{j}(y)$ . It can be verified that rejoining any rule body of the form ( $\ddagger$ ) results in a rule body whose diameter is bounded by the diameter of $\Pi^{\prime}_{i}$ . This gives the desired result since Step 1 preserves diameter.

Appendix D Proof of Theorem 26

See 26

Proof D.1.

Let $Q=(\mathcal{T},\mathbf{S}_{E},q_{0})$ be an OMQ from $(\mathcal{SHI},\text{UCQ})$ and let $n_{q_{0}}$ and $n_{\mathcal{T}}$ be the size of $q_{0}$ and $\mathcal{T}$ , respectively. We use ${\mathtt{sub}}(\mathcal{T})$ to denote the set of subconcepts of (concepts occurring in) $\mathcal{T}$ . Moreover, let $\Gamma$ be the set of all tree CQs that can be obtained from a CQ in $q_{0}$ by first existentially quantifying all answer variables, then contracting, and then taking a subquery. Every $q\in\Gamma$ can be viewed as a $\mathcal{ALCI}$ -concept provided that we additionally choose a root $x$ of the tree. We denote this concept with $C_{q,x}$ . For example, the tree CQ $q=\exists x\exists y\exists z\,r(x,y)\wedge A(y)\wedge s(x,z)$ and choice of $x$ as the root yields the $\mathcal{ALCI}$ -concept $C_{q,x}=\exists r.A\sqcap\exists s.\top$ . Let ${\mathtt{con}}(q_{0})$ be the set of all these concepts $C_{q,x}$ and let $\mathbf{S}_{I}$ be the schema that consists of monadic relation symbols $P_{C}$ and $\overline{P}_{C}$ for each $C\in{\mathtt{sub}}(\mathcal{T})\cup{\mathtt{con}}(q_{0})$ and nullary relation symbols $P_{q}$ and $\overline{P}_{q}$ for each $q\in\Gamma$ . We are going to construct an MDDLog program $\Pi$ over EDB schema $\mathbf{S}_{E}$ and IDB schema $\mathbf{S}_{I}$ that is equivalent to $Q$ .

By a diagram, we mean a conjunction $\delta(\mathbf{x})$ of atoms over the schema $\mathbf{S}_{E}\cup\mathbf{S}_{I}$ . For an interpretation $\mathcal{I}$ , we write $\mathcal{I}\models\delta(\mathbf{x})$ if there is a homomorphism from $\delta(\mathbf{x})$ to $\mathcal{I}$ , that is, a map $h:\mathbf{x}\rightarrow\Delta^{\mathcal{I}}$ such that:

(1)

$A(x)\in\delta$ * with $A\in\mathbf{S}_{E}$ implies $h(x)\in A^{\mathcal{I}}$ ;* 2. (2)

$r(x,y)\in\delta$ * with $r\in\mathbf{S}_{E}$ implies $(h(x),h(y))\in A^{\mathcal{I}}$ ;* 3. (3)

$P_{q}()\in\delta$ * implies $\mathcal{I}\models q$ and $\overline{P}_{q}()\in\delta$ implies $\mathcal{I}\not\models q$ ;* 4. (4)

$P_{C}(x)\in\delta$ * implies $h(x)\in C^{\mathcal{I}}$ and $\overline{P}_{C}()\in\delta$ implies $h(x)\notin C^{\mathcal{I}}$ .*

We say that $\delta(\mathbf{x})$ is realizable if there is an model $\mathcal{I}$ of $\mathcal{T}$ with $\mathcal{I}\models\delta(\mathbf{x})$ . A diagram $\delta(\mathbf{x})$ implies a CQ $q(\mathbf{x}^{\prime})$ , with $\mathbf{x}^{\prime}$ a tuple of variables from $\mathbf{x}$ , if every homomorphism from $\delta(\mathbf{x})$ to some model $\mathcal{I}$ of $\mathcal{T}$ is also a homomorphism from $q(\mathbf{x}^{\prime})$ to $\mathcal{I}$ . The MDDLog program $\Pi$ consists of the following rules:

(1)

the rule $P_{q}()\vee\overline{P}_{q}()\leftarrow{\mathtt{true}}(x)$ for each $q\in\Gamma$ ; 2. (2)

the rule $P_{C}(x)\vee\overline{P}_{C}(x)\leftarrow{\mathtt{true}}(x)$ for each $C\in{\mathtt{sub}}(\mathcal{T})\cup{\mathtt{con}}(q_{0})$ ; 3. (3)

the rule $\bot\leftarrow\delta(x)$ for each non-realizable diagram $\delta(x)$ that contains a single variable $x$ and only atoms of the form $P_{C}(x)$ , $C\in{\mathtt{sub}}(\mathcal{T})\cup{\mathtt{con}}(q_{0})$ ; 4. (4)

the rule $\bot\leftarrow\delta(\mathbf{x})$ for each non-realizable connected diagram $\delta(\mathbf{x})$ that contains at most two variables and at most three atoms; 5. (5)

the rule ${\mathtt{goal}}(\mathbf{x}^{\prime})\leftarrow\delta(\mathbf{x})$ for each diagram $\delta(\mathbf{x})$ that implies $q_{0}(\mathbf{x})$ , has at most $n_{q_{0}}$ variable occurrences, and uses only relations of the following form: $P_{q}$ , $P_{C}$ with $C$ a concept name that occurs in $q_{0}$ , and role names from $\mathbf{S}_{E}$ that occur in $q_{0}$ .

To understand $\Pi$ , a good first intuition is that rules of type 1 and 2 guess an interpretation $\mathcal{I}$ , rules of type 3 and 4 take care that the independent guesses are consistent with each other, with the facts in $I$ and with the inclusions in the TBox $\mathcal{T}$ , and rules of type 5 ensure that $\Pi$ returns the answers to $q_{0}$ in $\mathcal{I}$ .

However, this description is an oversimplification. Guessing $\mathcal{I}$ is not really possible since $\mathcal{I}$ might have to contain additional domain elements to satisfy existential quantifiers in $\mathcal{T}$ which may be involved in homomorphisms from (a CQ in) $q_{0}$ to $\mathcal{I}$ , but new elements cannot be introduced by MDDLog rules. Instead of introducing new elements, rules of type 1 and 2 thus only guess the tree CQs that are satisfied by those elements. Tree CQs suffice because $\mathcal{SHI}$ has a tree-like model property and since we have disallowed the use of roles in the query that have a transitive subrole. The notion of ‘diagram implies query’ used in the rules of type 5 takes care that the guessed tree CQs are taken into account when looking for homomorphisms from $q_{0}$ to the guessed model. This construction is identical to the one used in the proof of Theorem 1 of [BtCLW13], with two exceptions. First, we use predicates $P_{C}$ and $\overline{P}_{C}$ for every concept $C\in{\mathtt{sub}}(\mathcal{T})\cup{\mathtt{con}}(q_{0})$ while the mentioned proof uses a predicate $P_{t}$ for every subset $t\subseteq{\mathtt{sub}}(\mathcal{T})\cup{\mathtt{con}}(q_{0})$ . And second, our versions of Rules 3-5 are formulated more carefully. It can be verified that the correctness proof given in [BtCLW13] is not affected by these modifications. The modifications do make a difference regarding the size of $\Pi$ , though, which we analyse next.

It is not hard to see that, for some polynomial $p$ , the number of rules of type 1 is bounded by $2^{p(n_{q_{0}})}$ , the number of rules of type 2 and of type 4 is bounded by $2^{p(n_{q_{0}}\cdot{\mathtt{log}}n_{\mathcal{T}})}$ , the number of rules of type 3 is bounded by $2^{2^{p(n_{q_{0}}\cdot{\mathtt{log}}n_{\mathcal{T}})}}$ , and the number of rules of type 5 is bounded by $2^{2^{p(n_{q_{0}})}}$ . Consequently, the overall number of rules is bounded by $2^{2^{p(n_{q_{0}}\cdot{\mathtt{log}}n_{\mathcal{T}})}}$ and so is the size of $\Pi$ . The bounds on the size of the IDB schema and number of rules in $\Pi$ stated in Theorem 26 are easily verified. It remains to argue that the construction can be carried out in double exponential time. It suffices to observe two facts. First, consistency of a given diagram $\delta(\mathbf{x})$ can be decided in ExpTime since the satisfiability of $\mathcal{SHI}$ concepts w.r.t. TBoxes is in ExpTime [Tob01]. And second, for a given diagram $\delta(\mathbf{x})$ and CQ $q(\mathbf{x}^{\prime})$ with $\mathbf{x}^{\prime}$ a tuple of variables from $\mathbf{x}$ , it can be decided in time single exponential in the size of $\delta(\mathbf{x})$ and of $\mathcal{T}$ and double exponential in the size of $q(\mathbf{x}^{\prime})$ whether $\delta(\mathbf{x})$ implies $q(\mathbf{x}^{\prime})$ . This is a consequence of the fact that, in $\mathcal{SHI}$ , given an ABox $\mathcal{A}$ that may contain compound concepts (in place of concept names), a TBox $\mathcal{T}$ , a CQ $q(\mathbf{x})$ and a candidate answer $\mathbf{a}$ , it can be decided in time single exponential in the size of $\mathcal{A}$ and $\mathcal{T}$ and double exponential in the size of $q$ whether $a$ is a certain answer to $q$ on $\mathcal{A}$ w.r.t. $\mathcal{T}$ [GLHS08].

Appendix E Proof of Lemma 27

See 27

Proof E.1.

We convert $Q$ into an MDDLog program $\Pi_{0}$ as per Theorem 26 and then remove the answer variables according to the constructions in the proofs of Lemmas 17 and 18, which gives programs $\Pi_{1}$ and $\Pi_{2}$ . Analyzing the latter constructions reveals that the number of rules on $\Pi_{1}$ is bounded by $r\cdot a^{a}$ rules where $r$ is the number of rules in $\Pi_{0}$ and $a$ is its arity. Moreover, the rule size does not increase and neither the IDB schema nor the EDB schema changes. The latter construction produces a program with $r^{\prime}\cdot s^{s}$ rules where $r^{\prime}$ is the number of rules in $\Pi_{1}$ and $s$ is the rule size of $\Pi_{1}$ . Moreover, the IDB schema is not changed and the rule size at most doubles. The EDB schema of the new program comprises $a$ fresh monadic relation symbols. It can thus be verified that the obtained Boolean MDDLog program $\Pi_{2}$ still satisfies Conditions 1-3 of Theorem 26 except that $n_{q}$ in the last point has to be replaced by $2n_{q}$ . We make this explicit for the reader’s convenience:

(1)

the size of $\Pi_{2}$ is bounded by $2^{2^{p(n_{q}\cdot{\mathtt{log}}n_{\mathcal{T}})}}$ ; 2. (2)

the IDB schema of $\Pi_{2}$ is of size at most $2^{p(n_{q}\cdot{\mathtt{log}}n_{\mathcal{T}})}$ ; 3. (3)

the rule size of $\Pi_{2}$ is bounded by $2n_{q}$ .

We next convert $\Pi_{2}$ into a simple Boolean MDDLog program $\Pi_{Q}$ according to Theorem 1. Let us analyze the construction in detail to understand the size of $\Pi_{Q}$ , of its EDB schema $\mathbf{S}^{\prime}_{E}$ , and of its IDB schema $\mathbf{S}^{\prime}_{I}$ .

The initial variable identification step can be ignored. In fact, we start with at most $2^{2^{p(n_{q}\cdot{\mathtt{log}}n_{\mathcal{T}})}}$ rules, each of size at most $2n_{q}$ . Thus variable identification results in a factor of $(2n_{q})!$ regarding the program size and rule number, which is absorbed by $2^{2^{p(n_{q}\cdot{\mathtt{log}}n_{\mathcal{T}})}}$ , and the other relevant parameters do not change; in particular, the IDB schema remains unchanged.

The next and central step is to make rules biconnected. Given that the rule size is at most $2n_{q}$ , this can split up each rule into at most $2n_{q}$ rules. This is absorbed by the bounds on program size and rule number. However, on first glance it might seem that we end up with a double exponentially large IDB schema since we might have to split up a double exponential number of rules, each time introducing at least one fresh IDB relation. To argue that this is actually not the case, we distinguish rules of type 1-2 and 4-5 from the construction of $\Pi_{1}$ (proof of Theorem 26); note that the constructions in the proofs of Lemmas 17 and 18 modify the rules only in a very mild way and thus for every rule in $\Pi_{2}$ it is still clear which type it has.

We need not worry about rules of Type 1-2 and 4-5 since there are only $2^{p(|n_{q}|\cdot{\mathtt{log}}|n_{\mathcal{T}}|)}$ many such rules, each of size at most $2n_{q}$ , and thus the number of additional IDB relations introduced for making them biconnected is also bounded by $2^{p(n_{q}\cdot{\mathtt{log}}n_{\mathcal{T}})}$ . Rules of type 3 in $\Pi_{0}$ , on the other hand, are of a very restricted form, namely

[TABLE]

with $C_{1},\dots,C_{n}\in{\mathtt{sub}}(\mathcal{T})\cup{\mathtt{con}}(q_{0})$ . These rules are biconnected and thus we are done when $Q$ is Boolean. In the non-Boolean case, rules of the above lead to the introduction of additional rules in the construction in the proof of Lemma 18. This results in rules in $\Pi_{2}$ that are of the form

[TABLE]

where $R_{a}$ is one of the fresh IDB relations introduced in the mentioned construction. The latter rules have to be split up to be made biconnected. This will result in rules of the form

[TABLE]

Clearly, there are only $2^{p(n_{q}\cdot{\mathtt{log}}n_{\mathcal{T}})}$ many rule bodies of the latter form and thus it suffices to introduce at most the same number of fresh IDB relations $Q_{i}$ . Thus, the size of the IDB schema of $\Pi_{Q}$ is bounded by $2^{p(n_{q}\cdot{\mathtt{log}}n_{\mathcal{T}})}$ . Also note that, at this Point, the rule size has (potentially) decreased and is bounded by $\max\{n_{q},2\}$ . This is obvious for rules of Type 1-2 and 4-5, and also for the rules obtained from making rules of Type 3 biconnected, see above.

The last step is the change of EDB schema. It involves no blowups and we thus obtain the bounds stated in Lemma 27. In particular, the arity of relations in $\mathbf{S}^{\prime}_{E}$ is bounded by $\max\{n_{q},2\}$ since it is bounded by the rule size of the program that we had obtained before changing the EDB schema.

Bibliography54

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[ACKZ 09] Alessandro Artale, Diego Calvanese, Roman Kontchakov, and Michael Zakharyaschev. The DL-Lite family and relations. JAIR , 36:1–69, 2009.
2[AHV 95] Serge Abiteboul, Richard Hull, and Victor Vianu, editors. Foundations of Databases: The Logical Level . Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edition, 1995.
3[At CKT 11] Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, and Wang Chiew Tan. Characterizing schema mappings via data examples. ACM Trans. Database Syst. , 36(4):23:1–23:48, 2011.
4[Ats 08] Albert Atserias. On digraph coloring problems and treewidth duality. Eur. J. Comb. , 29(4):796–820, 2008.
5[Bar 16] Libor Barto. The collapse of the bounded width hierarchy. J. Log. Comput. , 26(3):923–943, 2016.
6[BBV 16] Michael Benedikt, Pierre Bourhis, and Michael Vanden Boom. A step up in expressiveness of decidable fixpoint logics. In Proc. of LICS 2016 , pages 817–826. IEEE Computer Society, 2016.
7[BCF 12] Manuel Bodirsky, Hubie Chen, and Tomás Feder. On the complexity of MMSNP. SIAM J. Discrete Math. , 26(1):404–414, 2012.
8[BD 13] Manuel Bodirsky and Víctor Dalmau. Datalog and constraint satisfaction with infinite templates. J. Comput. Syst. Sci. , 79(1):79–100, 2013.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Rewritability in Monadic Disjunctive Datalog, MMSNP, and Expressive Description Logics

Abstract.

Key words and phrases:

1. Introduction

2. Preliminaries

3. MDDLog, Simple MDDLog and CSP

4. FO- and MDLog-Rewritability of Boolean MDDLog Programs

Proposition 1**.**

Proof 4.1**.**

Lemma 2**.**

Proof 4.2**.**

Lemma 3**.**

Lemma 4**.**

Proof 4.3**.**

Proposition 5**.**

Theorem 6**.**

5. Shape of Rewritings, Obstructions, Explosion

Theorem 7**.**

Proof 5.1**.**

Corollary 8**.**

Proposition 9**.**

Proof 5.2**.**

Lemma 10**.**

Proof 5.3**.**

6. Datalog-Rewritability of Boolean MDDLog Programs

6.1. Datalog-Rewritability of Boolean MDDLog Programs

Lemma 11**.**

Proof 6.1**.**

Lemma 12**.**

Proof 6.2**.**

Theorem 13**.**

Lemma 14**.**

6.2. Canonical Datalog-Rewritings

Lemma 15**.**

Proof 6.3**.**

Theorem 16**.**

Proof 6.4**.**

7. Non-Boolean MDDLog Programs

7.1. Deciding Rewritability

Lemma 17**.**

Proof 7.1**.**

Lemma 18**.**

Proof 7.2**.**

Theorem 19**.**

Proof 7.3**.**

7.2. Canonical Datalog-Rewritings

Lemma 20**.**

Theorem 21**.**

7.3. Shape of Rewritings and Obstructions

Theorem 22**.**

Corollary 23**.**

Proposition 24**.**

Proof 7.4**.**

8. Ontology-Mediated Queries

8.1. Preliminaries

8.2. Rewritability of OMQs

Theorem 25**.**

Theorem 26**.**

Lemma 27**.**

9. Dichotomy and Deciding PTime Query Evaluation

Theorem 28**.**

Proof 9.1**.**

10. Discussion

Appendix A Translating Boolean MDDLog to generalized CSP

A.1. From MDDLog to Simple MDDLog

Lemma 29**.**

A.2. From Simple MDDLog to Generalized CSP

Lemma 30**.**

Appendix B MDLog-Rewritability of Generalized CSP

Theorem 31**.**

Proof B.1**.**

Appendix C Proof of Theorem 22

Proof C.1**.**

Appendix D Proof of Theorem 26

Proposition 1.

Proof 4.1.

Lemma 2.

Proof 4.2.

Lemma 3.

Lemma 4.

Proof 4.3.

Proposition 5.

Theorem 6.

Theorem 7.

Proof 5.1.

Corollary 8.

Proposition 9.

Proof 5.2.

Lemma 10.

Proof 5.3.

Lemma 11.

Proof 6.1.

Lemma 12.

Proof 6.2.

Theorem 13.

Lemma 14.

Lemma 15.

Proof 6.3.

Theorem 16.

Proof 6.4.

Lemma 17.

Proof 7.1.

Lemma 18.

Proof 7.2.

Theorem 19.

Proof 7.3.

Lemma 20.

Theorem 21.

Theorem 22.

Corollary 23.

Proposition 24.

Proof 7.4.

Theorem 25.

Theorem 26.

Lemma 27.

Theorem 28.

Proof 9.1.

Lemma 29.

Lemma 30.

Theorem 31.

Proof B.1.

Proof C.1.

Proof D.1.

Proof E.1.