Are crossing dependencies really scarce?
Ramon Ferrer-i-Cancho, Carlos Gomez-Rodriguez, J.L. Esteban

TL;DR
This paper investigates the scarcity of crossing dependencies in real sentences by comparing actual sentence structures to various baseline models, concluding that crossings are genuinely rare despite the trees' hubiness.
Contribution
It provides a quantitative analysis showing that crossing dependencies are scarce in real sentences and challenges the assumption that this scarcity is due to the hubiness of syntactic trees.
Findings
Crossings are significantly fewer in real sentences than in random models.
Real sentences are close to linear trees despite the potential for many crossings.
The scarcity of crossings is not explained by the hubiness of syntactic trees.
Abstract
The syntactic structure of a sentence can be modelled as a tree, where vertices correspond to words and edges indicate syntactic dependencies. It has been claimed recurrently that the number of edge crossings in real sentences is small. However, a baseline or null hypothesis has been lacking. Here we quantify the amount of crossings of real sentences and compare it to the predictions of a series of baselines. We conclude that crossings are really scarce in real sentences. Their scarcity is unexpected by the hubiness of the trees. Indeed, real sentences are close to linear trees, where the potential number of crossings is maximized.
| Language | ||||
|---|---|---|---|---|
| Arabic | 0.690 | |||
| Basque | 0.932 | |||
| Bengali | 0.944 | |||
| Bulgarian | 0.843 | |||
| Catalan | 0.790 | |||
| Czech | 0.780 | |||
| Danish | 0.734 | |||
| Dutch | 0.654 | |||
| English | 0.787 | |||
| Estonian | 0.974 | |||
| Finnish | 0.891 | |||
| German | 0.684 | |||
| Greek (Anc.) | 0.312 | |||
| Greek (Mod.) | 0.752 | |||
| Hindi | 0.858 | |||
| Hungarian | 0.728 | |||
| Italian | 0.851 | |||
| Japanese | 0.884 | |||
| Latin | 0.505 | |||
| Persian | 0.785 | |||
| Portuguese | 0.751 | |||
| Romanian | 0.953 | |||
| Russian | 0.810 | |||
| Slovak | 0.819 | |||
| Slovenian | 0.775 | |||
| Spanish | 0.794 | |||
| Swedish | 0.824 | |||
| Tamil | 0.979 | |||
| Telugu | 0.986 | |||
| Turkish | 0.945 | |||
| Macro avg | 0.800 |
| Language | ||||
|---|---|---|---|---|
| Arabic | 0.945 | |||
| Basque | 0.933 | |||
| Bengali | 0.939 | |||
| Bulgarian | 0.905 | |||
| Catalan | 0.955 | |||
| Czech | 0.785 | |||
| Danish | 0.880 | |||
| Dutch | 0.673 | |||
| English | 0.941 | |||
| Estonian | 0.992 | |||
| Finnish | 0.908 | |||
| German | 0.671 | |||
| Greek (Anc.) | 0.323 | |||
| Greek (Mod.) | 0.867 | |||
| Hindi | 0.769 | |||
| Hungarian | 0.738 | |||
| Italian | 0.959 | |||
| Japanese | 1.000 | |||
| Latin | 0.499 | |||
| Persian | 0.817 | |||
| Portuguese | 0.860 | |||
| Romanian | 1.000 | |||
| Russian | 0.907 | |||
| Slovak | 0.853 | |||
| Slovenian | 0.822 | |||
| Spanish | 0.945 | |||
| Swedish | 0.935 | |||
| Tamil | 0.988 | |||
| Telugu | 0.992 | |||
| Turkish | 0.914 | |||
| Macro avg | 0.857 |
| Language | ||||
|---|---|---|---|---|
| Arabic | ||||
| Basque | ||||
| Bengali | ||||
| Bulgarian | ||||
| Catalan | ||||
| Czech | ||||
| Danish | ||||
| Dutch | ||||
| English | ||||
| Estonian | ||||
| Finnish | ||||
| German | ||||
| Greek (Anc.) | ||||
| Greek (Mod.) | ||||
| Hindi | ||||
| Hungarian | ||||
| Italian | ||||
| Japanese | ||||
| Latin | ||||
| Persian | ||||
| Portuguese | ||||
| Romanian | ||||
| Russian | ||||
| Slovak | ||||
| Slovenian | ||||
| Spanish | ||||
| Swedish | ||||
| Tamil | ||||
| Telugu | ||||
| Turkish | ||||
| Macro avg |
| Language | ||||
|---|---|---|---|---|
| Arabic | ||||
| Basque | ||||
| Bengali | ||||
| Bulgarian | ||||
| Catalan | ||||
| Czech | ||||
| Danish | ||||
| Dutch | ||||
| English | ||||
| Estonian | ||||
| Finnish | ||||
| German | ||||
| Greek (Anc.) | ||||
| Greek (Mod.) | ||||
| Hindi | ||||
| Hungarian | ||||
| Italian | ||||
| Japanese | ||||
| Latin | ||||
| Persian | ||||
| Portuguese | ||||
| Romanian | ||||
| Russian | ||||
| Slovak | ||||
| Slovenian | ||||
| Spanish | ||||
| Swedish | ||||
| Tamil | ||||
| Telugu | ||||
| Turkish | ||||
| Macro avg |
| Language | ||||
|---|---|---|---|---|
| Arabic | 0.005 | |||
| Basque | 0.044 | |||
| Bengali | 0.163 | |||
| Bulgarian | 0.033 | |||
| Catalan | 0.009 | |||
| Czech | 0.018 | |||
| Danish | 0.032 | |||
| Dutch | 0.033 | |||
| English | 0.011 | |||
| Estonian | 0.237 | |||
| Finnish | 0.026 | |||
| German | 0.034 | |||
| Greek (Anc.) | 0.043 | |||
| Greek (Mod.) | 0.008 | |||
| Hindi | 0.004 | |||
| Hungarian | 0.013 | |||
| Italian | 0.025 | |||
| Japanese | 0.072 | |||
| Latin | 0.028 | |||
| Persian | 0.026 | |||
| Portuguese | 0.019 | |||
| Romanian | 0.056 | |||
| Russian | 0.025 | |||
| Slovak | 0.044 | |||
| Slovenian | 0.052 | |||
| Spanish | 0.012 | |||
| Swedish | 0.013 | |||
| Tamil | 0.008 | |||
| Telugu | 0.305 | |||
| Turkish | 0.078 | |||
| Macro avg | 0.049 |
| Language | ||||
|---|---|---|---|---|
| Arabic | 0.002 | |||
| Basque | 0.043 | |||
| Bengali | 0.168 | |||
| Bulgarian | 0.021 | |||
| Catalan | 0.006 | |||
| Czech | 0.013 | |||
| Danish | 0.018 | |||
| Dutch | 0.019 | |||
| English | 0.007 | |||
| Estonian | 0.236 | |||
| Finnish | 0.021 | |||
| German | 0.030 | |||
| Greek (Anc.) | 0.040 | |||
| Greek (Mod.) | 0.006 | |||
| Hindi | 0.002 | |||
| Hungarian | 0.015 | |||
| Italian | 0.015 | |||
| Japanese | 0.037 | |||
| Latin | 0.024 | |||
| Persian | 0.012 | |||
| Portuguese | 0.009 | |||
| Romanian | 0.042 | |||
| Russian | 0.018 | |||
| Slovak | 0.037 | |||
| Slovenian | 0.041 | |||
| Spanish | 0.008 | |||
| Swedish | 0.011 | |||
| Tamil | 0.007 | |||
| Telugu | 0.342 | |||
| Turkish | 0.056 | |||
| Macro avg | 0.043 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Are crossing dependencies really scarce?
R. Ferrer-i-Cancho
Complexity & Quantitative Linguistics Lab, LARCA Research Group
Departament de Ciències de la Computació
Universitat Politècnica de Catalunya
Campus Nord, Edifici Omega, Jordi Girona Salgado 1-3
08034 Barcelona, Catalonia (Spain)
C. Gómez-Rodríguez
Universidade da Coruña
FASTPARSE Lab, LyS Research Group
Departamento de Computación
Facultade de Informática, Elviña
15071 A Coruña, Spain
J. L. Esteban
Logic and Programming, LOGPROG Research Group
Departament de Ciències de la Computació
Universitat Politècnica de Catalunya
Campus Nord, Edifici Omega, Jordi Girona Salgado 1-3
08034 Barcelona, Catalonia (Spain)
Abstract
The syntactic structure of a sentence can be modelled as a tree, where vertices correspond to words and edges indicate syntactic dependencies. It has been claimed recurrently that the number of edge crossings in real sentences is small. However, a baseline or null hypothesis has been lacking. Here we quantify the amount of crossings of real sentences and compare it to the predictions of a series of baselines. We conclude that crossings are really scarce in real sentences. Their scarcity is unexpected by the hubiness of the trees. Indeed, real sentences are close to linear trees, where the potential number of crossings is maximized.
Keywords: spatial networks, syntactic dependency trees, crossings, baselines.
1 Introduction
Central to network theory is the definition of null models that shed light on the nature and the significance of network properties [1, 2]. A prototypical example of measure is , the clustering coefficient of a network (the average proportion of pairs of neighbours of a vertex that are connected) [3]. It is well known that real networks typically exhibit , where is the clustering coefficient of an Erdős-Rényi graph with the same density of links [3]. In this setup, where is the density of links of the real network. As real networks are sparse, is a small number while is typically a large number (a number close to ). Hence is much greater than expected by chance.
As a null hypothesis, the Erdős-Rényi graph involves minimal information from a real network: its number of vertices and its number of links. In an attempt to understand the origins of the properties of real networks, researchers have been defining null hypotheses that are stronger than the Erdős-Rényi graph in the sense that they involve more information from a real network. Perhaps the most popular example are random graphs with a degree sequence that matches that of the real graph. Various models that differ in how they sample the space of possible graphs have been designed. One is the configuration model or pairing model, a model that has been very successful from a theoretical perspective [4, 5, 6] but has very limited applicability as a baseline for real networks due to its inefficiency [2]. The configuration model samples uniformly on the space of configurations (pairings of stubs) [2]. Another example is the switching model, a model that produces a random graph from a given graph preserving the degree sequence as in the configuration model [7, 8, 9]. The switching model can be configured to sample uniformly over the space of possible graphs with the same degree sequence [10, 11].
Here we focus on baselines and null hypotheses for a particular kind of network, i.e., the syntactic structure of sentences, where nodes correspond to words and connections indicate syntactic dependencies between elements, e.g., the dependency between the subject of a sentence and the corresponding verb (Fig. 1) [12]. Syntactic dependency networks are typically trees [13, 14, 12] and constitute a particular case of spatial or geographical network [15, 16, 17] in one dimension, the dimension defined by the linear order of the words in the sentence. The specific measure which we aim to compare against null hypotheses is the number of crossings, which we will denote by in general. Suppose that vertices are arranged sequentially and that is the position of vertex in the sequence ( for the first vertex of the sequence, for the second vertex of the sequence, and so on). Suppose that we have two edges and such that and . We say that and cross if and only if or . With this definition one can count , the observed number of crossings for a given sentence. The top of Fig. 1 shows a planar sentence, i.e., a sentence without crossings (), whereas the bottom shows an ordering of the same sentence with one dependency crossing (), involving the dependency between “Yesterday” and “arrived” and the dependency between “woman” and “who”.
It is well known that crossing dependencies, those that cross each other when drawn above the words of a sentence, are relatively uncommon in natural language [19, 14]. It is widely accepted that the number of crossings of real sentences is small [19, 14, 12, 20, 21, 22, 23]. A challenge for the belief that the number of crossings is really small is that the proportion of sentences of a corpus that are not planar, namely, they have at least one crossing, can be very large. For instance, about of sentences in German and Dutch corpora are not planar (Table 1 of [24]). Another challenge is that the scarcity of crossings is not supported with a baseline or null hypothesis. For instance, a star tree (a tree where all connections are formed with a hub vertex as on top of Fig. 2) cannot have crossings [25]. Therefore, reaching the theoretical minimum number of crossings does not suffice to conclude that the number of crossings is smaller than expected by chance: for a star tree, it could not be otherwise. In other words, the number of crossings of a star tree is really small (it is minimum) but not scarce with respect to all possible linear orderings of its vertices.
In this article, we aim to quantify the actual number of crossings of real sentences and to clarify the issue of the presumable scarcity of crossing dependencies. More specifically, we will calculate the actual number of crossings in large collections of sentences and compare them against the predictions of baselines and null hypotheses that vary in the amount of information that they involve about a real tree, as it happens with null hypotheses for real networks.
The remainder of the article is organized as follows. Section 2 presents a series of baselines that will be used to assess if the actual number of crossings in sentences is really scarce. Some baselines are borrowed from previous research [26, 27] while others are introduced here. It also presents a measure of hubiness (a normalized measure of the similarity between a dependency tree and a star tree) and shows the relationship between that measure and the potential number of crossings of a tree. Section 3 presents the collections of dependency networks from different languages that will be used in Section 4 to compare the actual number of crossings of real dependency trees against the random baselines of Section 2. Section 4 also analyzes the degree of hubiness of real dependency trees. Section 5 discusses the results.
2 Baselines for the number of crossings
2.1 Absolute baselines for the number of crossings
Star trees and linear trees are crucial to understand the limits of the variation of . A star tree is a tree where a vertex has maximum degree (namely , an thus all other vertices have degree 1) [25]. A linear tree is a tree where vertex degrees do not exceed 2 (and therefore all vertices have degree 2 except a couple that have degree 1) [25]. See examples of star and linear trees in Fig. 2.
When looking for a reference for the actual number of crossings of a sentence, a first step is to calculate the potential number of crossings. In a syntactic dependency tree of nodes, the number of edges is and therefore the total number of crossings cannot exceed
[TABLE]
However, this is a rough estimate as edges that share a vertex cannot cross. Taking account this fact, one may define , the size of the set of pair of edges that may potentially cross. , the cardinality of this set, depends on and , the second moment of degree about zero, defined as
[TABLE]
with being the degree of the -th vertex of the network. In particular, one has [25]
[TABLE]
where
[TABLE]
is the value of for a star tree of vertices [25]. Indeed, reaches extreme values in star trees and linear trees.An overview of the arguments follows (see [27] and the Appendix of [28] for further mathematical details).
The variation of obeys
[TABLE]
where and are the value of in a linear tree and a star tree, respectively. Obviously, the minimum , namely, is achieved by a star tree because in that case. The maximum value of is achieved by a linear tree because that tree yields the minimum value of , namely
[TABLE]
(when ). Notice that
[TABLE]
Therefore, the maximum value of is
[TABLE]
for ( for ).
Obviously, for any linear arrangement of the vertices of a star tree. We wish to check if there are orderings of the vertices of a linear tree where actually . Suppose that the vertices of a linear tree with are labelled following a depth-first traversal from one of the leaves. This is equivalent to labelling vertices according to their position in a minimum linear arrangement [29]. Fig. 3 shows linear arrangements of small linear trees with maximum , namely . These arrangements can be built by placing all vertices with odd labels in ascending order followed by all vertices with even labels in ascending order. By symmetry, it is possible to build other arrangements where , e.g., placing all vertices with odd labels in descending order followed by all vertices with even labels in descending order. Appendix A presents arrangements that reach for a linear tree of an arbitrary size (). As
[TABLE]
(recall Eq. 8), it is easy to see how crossing theory replaces the naive upper bound of in Eq. 1 with a tight one.
Given the results above, the actual number of crossings of a star tree is not surprising at all according to . In contrast, achieving a low number of crossings in a linear tree is unexpected if the tree is sufficiently large.
2.2 A hubiness coefficient
We have seen above that is a fundamental structural property of a tree: it determines . As determines the range of variation of , also determines the solution to the minimum linear arrangement problem: the solution is minimum for linear trees and maximum for star trees [29]. is a measure of the hubiness of a tree [25] and we have seen that its range of variation is
[TABLE]
where and are the value of in a linear tree and a star tree respectively. The latter allows one to define a hubiness coefficient as
[TABLE]
for (for , the only trees that can be formed are both linear and star trees). It is easy to show that . On the one hand, the fact that
[TABLE]
gives
[TABLE]
On the other hand, the fact that
[TABLE]
gives
[TABLE]
Therefore, measures the similarity between a tree and a star tree (or the dissimilarity with respect to a linear tree) from the perspective of . Applying Eqs. 4 and 6 to Eq. 11, one obtains
[TABLE]
Note that is a normalized degree variance. To see it, recall that the degree variance is
[TABLE]
and that is fully determined by because [30] for any tree such that . Therefore,
[TABLE]
and
[TABLE]
is 1 when the degree variance is maximum and 0 when variance is minimum.
It is also easy to show that is the complementary of the normalized potential number of crossings, i.e.,
[TABLE]
Therefore, is 1 when the potential number of crossings is minimum and 0 when it is maximum. Applying the definition of in Eq. 3 and in Eq. 8 to
[TABLE]
one recovers Eq. 16 after some algebra.
2.3 Random baselines for the number of crossings
We consider , the number of crossings of a sentence, in a uniformly random linear arrangement (URLA) of its elements. In this baseline, the expected number of crossings is [26, 27]
[TABLE]
Another baseline can be obtained assuming that the tree is a uniformly random labelled tree (URLT). This choice improves previous research where random trees that deviate from a uniform distribution were used [20] as a control for .
The expected value of in a URLT is [26]
[TABLE]
and then the expected number of crossings in a URLA of an URLT is
[TABLE]
Notice that is related with the unrestricted baseline of the previous subsection. Combining Eqs. 8 and 25, one obtains
[TABLE]
for sufficiently large . This implies that the expected number of crossings in a random linear arrangement of a URLT is very close to the expected number of crossings in a URLA of a linear tree.
2.4 Random baselines for the hubiness coefficient
Recalling Eqs. 7 and 6, it is easy to see that the expected value of the hubiness coefficient in a uniformly random labelled tree is
[TABLE]
Recalling 23 and noting that
[TABLE]
we finally obtain
[TABLE]
The latter implies that, as tends to infinity, the expected hubiness of URLTs vanishes while the similarity between URLTs and linear trees is maximized. Linear trees swallow practically all probability mass, in agreement with the finding that the expected number of crossings in a URLA of a URLT tends to that of a linear tree as tends to infinity (Eq. 27). Furthermore, the finding that suggests that the harmonic mean (or its inverse) could be used to evaluate the hubiness of real sentences with respect to URLTs.
2.5 Network theory revisited
In Section 1, we have reviewed various null hypotheses that are used in network theory. The Erdős-Rényi model takes the number of vertices and the number of links of a real network and discards the structure of the real network. The configuration or pairing model and the switching model go a step further incorporating the degree distribution. Our baselines and null hypotheses also parallel this increasing amount of information about the real network that they incorporate.
Recall the two kinds of upper bounds for in Section 2. If we consider the structure of the tree under consideration irrelevant (e.g., ), the upper bound is (Eq. 8), the maximum value that can achieve. This bound parallels the Erdős-Rényi model (notice that in a tree the number of edges is and thus not relevant). If we consider the tree structure relevant, then the upper bound is (Eq. 3) with calculated on the tree under consideration. This bound parallels the configuration or pairing model and the switching model: it involves the degree sequence but knowing suffices.
Recall also the two kinds of random baselines for in Section 2.3. Neglecting the structure of the tree under consideration (e.g., ), a potential baseline is (Eq. 25), the expected number of crossings in a uniformly random tree. This null hypothesis parallels the Erdős-Rényi model. Conditioning on the tree structure, then the potential baseline is (Eq. 3) with taken from the tree under consideration. This null hypothesis parallels the configuration or pairing model and the switching model for involving the degree sequence or a function of it.
Our hubiness coefficient is a normalized and we have seen that plays a fundamental role in trees: its extremal values determine the limits of the variation of and thus also the expected number of crossings in a random linear arrangement of a tree. These values also determine the limits of the variation of , the minimum sum of edge lengths in a linear arrangement of a given tree ( is minimum for linear trees and maximum for star trees) [29]. Such a role is reminiscent of the role played by in large complex networks concerning, for instance, the spread of epidemics on a network (e.g. a virus on the Internet): if diverges the pandemics cannot be stopped [31]. As in trees [30], our work extends the importance of to the domain of trees.
3 Materials and methods
We aim to compare against the different baselines with the help of dependency treebanks. A dependency treebank is a collection, or corpus, of sentences where a dependency graph is provided for every sentence. Our treebanks come from version 2.0 of the HamleDT collection of treebanks [32, 33]. This collection harmonizes previously existing treebanks for 30 different languages into two widely-used annotation guidelines: Universal Stanford dependencies [34] and Prague dependencies [35]. Therefore, this resource allows us to evaluate the baselines not only across a wide range of languages of different families, but also across two well-known annotation schemes. This is useful because observations like the number of dependency crossings in a sentence not only depend on the language, as they are also influenced by annotation criteria ([28] review some examples of how can be affected by annotation criteria).
As preprocessing, we removed nodes corresponding to punctuation from the analyses in the treebanks, following common practice in research related to statistical properties of dependency structures (e.g. [36, 37]), which is only concerned with dependency relations between actual words. Null elements, which are present in the Bengali, Hindi and Telugu corpora, were also removed as they do not correspond to words. To preserve the structure of the rest of the tree after removing these nodes, non-deleted nodes that had a deleted node as their head were reattached as dependents of their nearest non-removed ancestor. The size of the tree that is obtained corresponds to the length of the sentence in words.
After this preprocessing, we included in our analyses those syntactic dependency structures that (1) defined a tree with at least 4 nodes, and such that (2) the tree was not a star tree. The reason for (1) is that our baselines assume a tree structure [27, 26] and that we wished to avoid the statistical problem of mixing trees with other kinds of graphs, e.g., the potential number of crossings depends on the number of edges [25, 38, 27]. We focus on trees of at least 4 nodes because for , the number of crossings is always zero. The reason for (2) is that a star tree cannot have crossing dependencies [25]. Ratios with in the numerator and in the denominator, e.g., the relative number of crossings, [26], are not defined because . Tables 5 and 6 show , the proportion of trees that are star trees (this proportion is calculated after applying condition (1)). On average, this proportion is smaller than .
As star trees are excluded, the random baselines on uniformly random trees must be adapted (see Appendix B for further details). is replaced by the same expectation conditioning on the fact that star trees are excluded, i.e.
[TABLE]
It is easy to see that
[TABLE]
for sufficiently large (compare Eqs. 25 and 31).
The same applies to , that has to be replaced by
[TABLE]
It is easy to see that
[TABLE]
for sufficiently large .
The corrected versions of the random baselines are expected to matter especially in treebanks with a sufficient concentration of sentences near .
To assess if the number of crossings in our dataset is significantly small, we conducted two Monte Carlo tests for each treebank, corresponding to each of the two random models of trees. In the first test, we evaluated the significance of the observed values of for each treebank with respect to URLTs, by generating randomized versions of the corpora where each tree is replaced by an URLT with the same number of nodes. To generate each URLT, we produced a uniformly distributed Prüfer code [39] and then converted it to a tree (as an implementation with the Aldous-Broder algorithm [40, 41] proved too slow). The Monte Carlo procedure is used to estimate left-, the probability that a randomized corpus yields a value of that is at least as small as the original one. One concludes that is significantly small if left- is small enough. In the second test, we evaluated the significance of with respect to URLAs of the trees in the treebank, by generating randomized versions of the treebanks where each syntactic tree is replaced by an URLA of itself. left- is estimated as in the 1st test. Each test is based on randomizations of the treebank. Notice that these tests preserve the distribution of tree sizes of the original treebank, that is required to evaluate the significance of a measurement over a whole treebank accurately [42].
We also performed the same couple of tests to evaluate the significance of , the proportion of planar sentences of a treebank. To evaluate the significance of , we used URLTs as in the 1st test to estimate left- and also right- (the latter being the probability that the randomized corpus yields a value of that is at least as large as the original one).
4 Results
The claim that dependency crossings are scarce in real sentences can be evaluated with at least two statistics. Firstly, , the proportion of planar sentences (sentences without crossings). tends to decrease as increases on average for all treebanks (Fig. 4). A detailed analysis over all tree sizes shows that this number varies substantially across treebanks (Tables 1 and 2). It is minimum in Ancient Greek with while it reaches its theoretical maximum value () for Japanese and Romanian with Prague dependencies. The second smallest proportion of planar sentences is achieved by Latin with , followed by German and Dutch with slightly below . Our findings are consistent with a previous report of of sentences in German and Dutch that are not planar (Table 1 of [24]).
Secondly, one can look at the behavior of the actual number of crossings. tends to increase as increases over all treebanks (Fig 5). Interestingly, the plots in double logarithmic scale reveal the presence of a breakpoint at that separates an initial regime of fast growth of from a second regime of slower growth (Fig 5). Hereafter, we will use over a tree measure to indicate a mean over the whole ensemble of sentences of a treebank included in our analysis. Although the proportion of planar sentences can be very low when putting all tree sizes together, the number of crossings is apparently small: does not reach in any of the treebanks (Tables 1 and 2). is above 1 in only three languages: Ancient Greek, Latin and Dutch for Stanford dependencies; and only Ancient Greek and Latin for Prague dependencies. These average numbers of observed crossings are really small when compared against the average potential number of crossings of a linear tree of the same size () or the average potential number of crossings of the same tree (). In order of magnitude, the difference between and is small, suggesting that real trees are close to linear trees, namely, their hubiness is low.
A deeper evaluation of the scarcity of crossing dependencies can be made with the help of ratios between and the different baselines: , , and . has already been used in research on crossings in random trees [26]. Bear in mind that
- •
All these ratios are positive but only and are bounded above by .
- •
Each ratio defined on random baselines is proportional or approximately proportional to a deterministic baseline. On the one hand,
[TABLE]
thanks to Eq. 22. On the other hand,
[TABLE]
[TABLE]
for sufficiently large .
- •
Although
[TABLE]
thanks to , the relationship between
[TABLE]
and
[TABLE]
is uncertain.
and tend to decrease as tree size increases (Fig. 6) and the same is expected to happen to their corresponding random baselines thanks to the proportionality relationships above (Eqs. 35 and 37). Therefore, the evidence of the scarcity of crossings increases as tree size increases.
The ratios in Tables 3 and 4 show that, on average, the actual number of crossings is smaller than that of the baseline for all treebanks and for all baselines: all average ratios are below 0.3. These ratios allow one to analyze with more detail the difference in magnitude between and the different baselines (Tables 3 and 4):
- •
indicates that, on average, is at least 10 times smaller than and across languages. The smallest differences are achieved by Ancient Greek, where and . The relative number of crossings with respect to the same tree, i.e., , is expected to be about in a random linear arrangement of vertices [26] but indeed it is much smaller.
- •
as expected but the difference between and is small, suggesting that real trees are closer to linear trees than to star trees.
- •
indicates that, on average, is at least 10 times smaller than across treebanks except for Ancient Greek and Latin. For Ancient Greek, on average with Stanford dependencies and with Prague dependencies. For Latin, on average with Stanford dependencies and with Prague dependencies.
- •
indicates that, on average, is at least 10 times smaller than across treebanks except for Ancient Greek and Latin. For Ancient Greek, on average with both Stanford and Prague dependencies. For Latin, with Stanford dependencies and with Prague dependencies.
- •
The difference between and is small. The condition holds for all treebanks with Stanford annotations, as well as for all treebanks with Prague annotations except for Japanese and Persian.
The significance of the gap that separates the actual number of crossings and the predictions of random baselines must be evaluated statistically. Indeed, and are smaller than expected by URLTs and URLAs: the Monte Carlo test described in Section 3 yields left- for all the treebanks and both random baselines.
Fig. 7 shows that the hubiness of trees tends to decrease as increases. Tables 5 and 6 also show that never exceeds and is across treebanks, suggesting that real trees are closer to linear trees than to star trees. The similarity between linear trees and real trees supports the little difference reported above between and . Indeed, recall the alternative definition of in Eq. 20. Concerning URLTs, Fig. 7 shows that the average hubiness of real sentences tends to be above the average hubiness that is expected in a URLT over the ensemble of treebanks. A detailed analysis reveals that the average hubiness of real sentences is above the average hubiness that is expected in a URLT for all treebanks with Stanford dependencies (Table 5). However, this does not hold for the Arabic, Japanese and Persian treebank with Prague dependencies (Table 6) but the difference is small. The systematic deviation between and suggests that the hubiness of real dependency trees cannot be explained by sampling of URLTs, especially for Stanford dependencies. The gap between URLTs and real syntactic dependency trees is smaller for Prague dependencies, as Fig. 7 suggests. Notice that is about twice with Stanford dependencies whereas is about 1.4 times with Prague dependencies. The Monte Carlo tests indicate that is significantly large in all treebanks with Stanford annotations (right-). The results are less homogeneous for Prague annotations: is significantly small in Arabic, Japanese and Persian (left-) but significantly large for the remainder (right- in all cases except right- for Portuguese).
5 Discussion
We have clarified the issue of the scarcity of crossing dependencies. We have provided the first evidence that the actual number of crossings is significantly small. From the perspective of planarity, the proportion of non-planar sentences can be ”high” in certain languages (e.g., Dutch) but still significantly low. On the other hand, the mean number of crossings per sentence is a small number, consistently with the claim that crossings in real sentences are scarce [19, 14, 12, 20, 21, 23] even in languages where non-planar sentences abound. However, whether a number is small or large is a matter of the scale or the units of measurement [43]. Therefore, statistical testing and a theory of crossings (Section 2) are vital. The former shows that crossings are significantly low. The latter helps to understand why and how.
The low number of crossings of real sentences could be trivially explained by a high hubiness, which would immediately lead to a low value of , the potential number of crossings. Fig. 7 indicates that this is unlikely to be the case for sufficiently large trees: the hubiness of trees tends to decrease as increases and so the relative number of crossings does (Fig. 6). The contribution of hubiness to keeping the number of crossings low decreases as increases.
Furthermore, the hubiness coefficient never exceeds and is about on average, although it is significantly high with respect to URLTs in the majority of treebanks. The point: is this number large enough to expect a low number of crossings? Thanks to Eq. 20, the relative potential of crossings with respect to a linear tree turns out to be at least , and on average. This strongly suggests that hubiness has a secondary role in explaining the scarcity of crossing dependencies. Indeed, we have seen above that various baselines indicate that real trees are close to linear trees. We have also seen that the gap between real trees and URLTs reduces with Prague annotations. The statistical similarity between real dependency trees and linear trees is what makes the low number of crossing dependencies to be really scarce: linear trees maximize the potential number of crossings, as we have shown above.
The challenge for future research is to determine the true reason for the low number of crossings in sentences. A long standing hypothesis is that the low number of crossings of real sentences is a side effect of the principle of dependency length minimization, namely, the minimization of the distance between linked vertices in the linear sequence [20, 38, 26, 44]. The low hubiness of real sentences suggests that hubiness may have a secondary role in reducing crossing dependencies. We hope that our quantification of the number of crossing dependencies with respect to baselines stimulates further research on the actual origin of their scarcity and the weight of different factors.
We have observed a breakpoint in the decay of the average number of crossings across treebanks at (Fig. 5) that is also suggested by the decay of the average relative number of crossings (Fig. 6). We suspect that it could be related to increasing pressure for dependency length minimization for longer sentences. However, the real nature of the breakpoint should be investigated further.
Although the conclusion that crossings in sentences are really scarce does not depend on the annotation format, our analyses indicate that Stanford and Prague dependencies are not statistically equivalent. For instance, we have seen that real trees are closer to URLTs with respect to hubiness when Prague dependencies are considered. This is in line with recent results highlighting various other relevant quantifiable differences between annotation criteria, e.g. in their suitability for automatic parsing [45, 46] or in the prevalence of certain patterns of crossing dependencies [47]. Thus, considering more than one annotation format is useful to analyze underlying properties of syntax, and distinguish them from properties of a specific annotation.
It is worth bearing in mind that syntactic annotation schemes are typically designed based on linguistic considerations [48], as well as technical considerations to facilitate the work of parsers and other language processing systems [49], independently from statistical considerations [50]. Our findings suggest that statistical implications should be involved when improving current annotation formats or developing new ones. Identifying the most appropriate statistical ensemble for syntactic dependency trees is an important problem that should be the subject of future research.
6 Conclusion
We have shown that the number of crossings of real sentences is really scarce with the help of different baselines. Although that scarcity could be easily explained by a high hubiness, the hubiness of real sentences is rather low suggesting that it has a secondary role in the low number of crossings of real sentences. Statistically, syntactic dependency trees seem to be closer to linear trees than to star trees. Our findings provide support for the hypothesis that dependency length minimization is the main force responsible for the scarcity of crossing dependencies.
Appendix A The maximum number of crossings of a linear tree
Figure 8 shows arrangements with maximum number of crossings for a series of linear trees of nodes, with . Each tree of nodes is obtained by adding the vertex to the tree of nodes. In all cases, the linear ordering of the vertices consists of the odd vertex labels in increasing order, followed by the even vertex labels also in increasing order. We will show that this kind of arrangements achieves the maximum possible number of crossings for linear trees of nodes. Formally, these orderings can be defined as the sequence of vertices
[TABLE]
Let be the corresponding number of crossings. Notice that for [27].
In Figure 8, we adopt the convention that the edge is always red, is always blue, is always green and is always brown for all linear trees. Thus, it is easy to check the contribution to of the edge with respect to ; when , the edge adds one crossing; when , the edge adds two crossings; when , the edge adds three crossings and, when , the edge adds four crossings. After this introduction now comes the proof.
We aim to show that (Eq. 8) for . First, , setting the base case. Second, we aim to show that with for . Suppose that a tree of vertices becomes a tree of vertices adding vertex and the edge . If is odd, the edge crosses any two edges formed with node , namely edges and , for even and . This yields . Note that cannot cross as they share vertex . If is even, then crosses and any two edges formed with node such that is odd and , giving again . Therefore,
[TABLE]
and finally (Eq. 8), as we wanted to prove.
Appendix B Expectations on uniformly random labelled trees excluding star trees
There are labelled star trees: each can be constructed by choosing one of the vertices as the hub. Since there are labelled trees in total [51], the probability that a URLT is a star tree is
[TABLE]
We define the sum of squared degrees of a tree as [25]
[TABLE]
and define as the probability that a URLT has as sum of squared degrees knowing that it not a star tree. We have that
[TABLE]
We have seen above that the maximum value of for a given is achieved by a star tree (Eq. 10), and hence the same can be said about the maximum value of . If we call this value , then
[TABLE]
Therefore, for , we can apply and to obtain
[TABLE]
If star trees are excluded, the maximum hubiness is reached by a quasi-star tree, a tree that gives the second largest value of , and is defined by one vertex of degree , one vertex of degree 2 and the remainder of vertices of degree 1 [38] (Fig. 2). Suppose that and are the values of of a linear tree and a quasi-star tree, respectively. The expectation of of a URLT knowing that it is not a star tree is
[TABLE]
Knowing that
[TABLE]
[TABLE]
thanks to Eq. 23, and recalling Eq. 43, one obtains
[TABLE]
Notice that
[TABLE]
for sufficiently large (compare Eqs. 23 and 51).
Adapting Eq. 24 to , one obtains
[TABLE]
Plugging Eq. 51 to 53, one obtains
[TABLE]
and also
[TABLE]
Adapting Eq. 28 to , one obtains
[TABLE]
Note that
[TABLE]
and then
[TABLE]
It is easy to see that
[TABLE]
for sufficiently large . For numerical reasons, it is convenient to use Eq. 58 till and then replace the formula simply by . can be chosen as the largest value of for which Eq. 58 does not produce numerical overflows when calculating the powers. Such a critical value increases through the decomposition
[TABLE]
with
[TABLE]
All the corrected expectations that we have calculated in this section require because there are no labelled trees with such that they are not star trees.
Acknowledgements
RFC is funded by the grants 2014SGR 890 (MACDA) from AGAUR (Generalitat de Catalunya) and also the APCOM project (TIN2014-57226-P) from MINECO. CGR has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 714150 - FASTPARSE) and from the TELEPARES-UDC project (FFI2014-51978-C2-2-R) from MINECO. JLE is funded by the project TASSAT3 (TIN2016-76573-C2-1-P) from MINECO (Ministerio de Economía y Competitividad).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] R. Cohen and S. Havlin. Complex networks. Structure, robustness and function . Cambridge University Press, Cambridge, UK, 2010.
- 2[2] M. E. J. Newman. Networks. An introduction . Oxford University Press, Oxford, 2010.
- 3[3] M. E. J. Newman. The structure and function of complex networks. SIAM Review , pages 167–256, 2003.
- 4[4] E.A. Bender and E. R. Canfield. The asymptotic number of labeled graphs with given degree sequences. J. Combin. Theory Ser. A , 24, 1978.
- 5[5] M. Molloy and B. Reed. A critical point for random graphs with a given degree sequence. Random Structures and Algorithms , 6:161–180, 1995.
- 6[6] M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distribution and their applications. Phys. Rev. E , 64:026118, 2001.
- 7[7] R. S. Milo, S. Shen-Orr, S.Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: simple building blocks of complex networks. Science , 298:824–827, 2002.
- 8[8] R. Milo, N. Kashtan, S. Itzkovitz, M.E.J. Newman, and U. Alon. On the uniform generation of random graphs with prescribed degree sequences. ar Xiv preprint cond-mat/0312028 , 2003.
