Automatic Generation of High Quality CCGbanks for Parser Domain Adaptation
Masashi Yoshikawa, Hiroshi Noji, Koji Mineshima, Daisuke Bekki

TL;DR
This paper introduces a domain adaptation method for CCG parsing that automatically generates CCG corpora from dependency trees, significantly improving parser performance across diverse domains.
Contribution
The paper presents a simple, parser-architecture-independent method for domain adaptation by automatically generating CCG data from dependency resources.
Findings
Significant performance improvements on speech conversation and math problem datasets.
Effective domain adaptation demonstrated across four different datasets.
Method is compatible with current top-performing CCG parsers.
Abstract
We propose a new domain adaptation method for Combinatory Categorial Grammar (CCG) parsing, based on the idea of automatic generation of CCG corpora exploiting cheaper resources of dependency trees. Our solution is conceptually simple, and not relying on a specific parser architecture, making it applicable to the current best-performing parsers. We conduct extensive parsing experiments with detailed discussion; on top of existing benchmark datasets on (1) biomedical texts and (2) question sentences, we create experimental datasets of (3) speech conversation and (4) math problems. When applied to the proposed method, an off-the-shelf CCG parser shows significant performance gains, improving from 90.7% to 96.6% on speech conversation, and from 88.5% to 96.8% on math problems.
| Method | UF1 | LF1 |
|---|---|---|
| depccg | 94.0 | 88.8 |
| + ELMo | 94.98 | 90.51 |
| Converter | 96.48 | 92.68 |
| Relation | Parser | Converter | # |
| (a) PPs attaching to NP / VP | |||
| 90.62 | 97.46 | 2,561 | |
| 81.15 | 88.63 | 1,074 | |
| \hdashline (b) Subject / object relative clauses | |||
| 93.44 | 98.71 | 307 | |
| 90.48 | 93.02 | 20 | |
| Method | P | R | F1 |
|---|---|---|---|
| C&C | 77.8 | 71.4 | 74.5 |
| EasySRL | 81.8 | 82.6 | 82.2 |
| \hdashlinedepccg | 83.11 | 82.63 | 82.87 |
| + ELMo | 85.87 | 85.34 | 85.61 |
| + GENIA1000 | 85.45 | 84.49 | 84.97 |
| + Proposed | 86.90 | 86.14 | 86.52 |
| Method | P | R | F1 |
|---|---|---|---|
| C&C | - | - | 86.8 |
| EasySRL | 88.2 | 87.9 | 88.0 |
| \hdashlinedepccg | 90.42 | 90.15 | 90.29 |
| + ELMo | 90.55 | 89.86 | 90.21 |
| + Proposed | 90.27 | 89.97 | 90.12 |
| a. | we should cause it does help |
|---|---|
| b. | the only problem i see with term limitations is that i think that the bureaucracy in our government as is with most governments is just so complex that there is a learning curve and that you ca n’t just send someone off to washington and expect his first day to be an effective congress precision |
| Error type | # |
|---|---|
| PP-attachment | 3 |
| Adverbs attaching wrong place | 11 |
| Predicate-argument | 5 |
| Imperative | 2 |
| Informal functional words | 2 |
| Others | 11 |
| Method | Whole | Subset | |||
|---|---|---|---|---|---|
| P | R | F1 | UF1 | LF1 | |
| depccg | 74.73 | 73.91 | 74.32 | 90.68 | 82.46 |
| + ELMo | 75.76 | 76.62 | 76.19 | 93.23 | 86.46 |
| + Proposed | 78.03 | 77.06 | 77.54 | 95.63 | 92.65 |
| Method | UF1 | LF1 |
|---|---|---|
| depccg | 88.49 | 66.15 |
| + ELMo | 89.32 | 70.74 |
| + Proposed | 95.83 | 80.53 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
Automatic Generation of High Quality CCGbanks
for Parser Domain Adaptation
Masashi Yoshikawa1
yoshikawa.masashi.yh8@
is.naist.jp
& Hiroshi Noji2
\ANDKoji Mineshima3
& Daisuke Bekki3
\AND1Nara Institute of Science and Technology, Nara, Japan
2Artificial Intelligence Research Center, AIST, Tokyo, Japan
3Ochanomizu University, Tokyo, Japan
Abstract
We propose a new domain adaptation method for Combinatory Categorial Grammar (CCG) parsing, based on the idea of automatic generation of CCG corpora exploiting cheaper resources of dependency trees. Our solution is conceptually simple, and not relying on a specific parser architecture, making it applicable to the current best-performing parsers. We conduct extensive parsing experiments with detailed discussion; on top of existing benchmark datasets on (1) biomedical texts and (2) question sentences, we create experimental datasets of (3) speech conversation and (4) math problems. When applied to the proposed method, an off-the-shelf CCG parser shows significant performance gains, improving from 90.7% to 96.6% on speech conversation, and from 88.5% to 96.8% on math problems.
1 Introduction
The recent advancement of Combinatory Categorial Grammar (CCG; Steedman (2000)) parsing Lee et al. (2016); Yoshikawa et al. (2017), combined with formal semantics, has enabled high-performing natural language inference systems Abzianidze (2017); Martínez-Gómez et al. (2017). We are interested in transferring the success to a range of applications, e.g., inference systems on scientific papers and speech conversation.
To achieve the goal, it is urgent to enhance the CCG parsing accuracy on new domains, i.e., solving a notorious problem of domain adaptation of a statistical parser, which has long been addressed in the literature. Especially in CCG parsing, prior work Rimell and Clark (2008); Lewis et al. (2016) has taken advantage of highly informative categories, which determine the most part of sentence structure once correctly assigned to words. It is demonstrated that the annotation of only pre-terminal categories is sufficient to adapt a CCG parser to new domains. However, the solution is limited to a specific parser’s architecture, making non-trivial the application of the method to the current state-of-the-art parsers Lee et al. (2016); Yoshikawa et al. (2017); Stanojević and Steedman (2019), which require full parse annotation. Additionally, some ambiguities remain unresolved with mere supertags, especially in languages other than English (as discussed in Yoshikawa et al. (2017)), to which the method is not portable.
Distributional embeddings are proven to be powerful tools for solving the issue of domain adaption, with their unlimited applications in NLP, not to mention syntactic parsing Lewis and Steedman (2014b); Mitchell and Steedman (2015); Peters et al. (2018). Among others, Joshi et al. (2018) reports huge performance boosts in constituency parsing using contextualized word embeddings Peters et al. (2018), which is orthogonal to our work, and the combination shows huge gains. Including Joshi et al. (2018), there are studies to learn from partially annotated trees Mirroshandel and Nasr (2011); Li et al. (2016); Joshi et al. (2018), again, most of which exploit specific parser architecture.
In this work, we propose a conceptually simpler approach to the issue, which is agnostic on any parser architecture, namely, automatic generation of CCGbanks (i.e., CCG treebanks)111 In this paper, we call a treebank based on CCG grammar a CCGbank, and refer to the specific one constructed in Hockenmaier and Steedman (2007) as the English CCGbank. for new domains, by exploiting cheaper resources of dependency trees. Specifically, we train a deep conversion model to map a dependency tree to a CCG tree, on aligned annotations of the Penn Treebank Marcus et al. (1993) and the English CCGbank Hockenmaier and Steedman (2007) (Figure 1a). When we need a CCG parser tailored for a new domain, the trained converter is applied to a dependency corpus in that domain to obtain a new CCGbank (1b), which is then used to fine-tune an off-the-shelf CCG parser (1c). The assumption that we have a dependency corpus in that target domain is not demanding given the abundance of existing dependency resources along with its developed annotation procedure, e.g., Universal Dependencies (UD) project Nivre et al. (2016), and the cheaper cost to train an annotator.
One of the biggest bottlenecks of syntactic parsing is handling of countless unknown words. It is also true that there exist such unfamiliar input data types to our converter, e.g., disfluencies in speech and symbols in math problems. We address these issues by constrained decoding (§4), enabled by incorporating a parsing technique into our converter. Nevertheless, syntactic structures exhibit less variance across textual domains than words do; our proposed converter suffers less from such unseen events, and expectedly produces high-quality CCGbanks.
The work closest to ours is Jiang et al. (2018), where a conversion model is trained to map dependency treebanks of different annotation principles, which is used to increase the amount of labeled data in the target-side treebank. Our work extends theirs and solves a more challenging task; the mapping to learn is to more complex CCG trees, and it is applied to datasets coming from plainly different natures (i.e., domains). Some prior studies design conversion algorithms to induce CCGbanks for languages other than English from dependency treebanks Bos et al. (2009); Ambati et al. (2013). Though the methods may be applied to our problem, they usually cannot cover the entire dataset, consequently discarding sentences with characteristic features. On top of that, unavoidable information gaps between the two syntactic formalisms may at most be addressed probabilistically.
To verify the generalizability of our approach, on top of the existing benchmarks on (1) biomedical texts and (2) question sentences Rimell and Clark (2008), we conduct parsing experiments on (3) speech conversation texts, which exhibit other challenges such as handling informal expressions and lengthy sentences. We create a CCG version of the Switchboard corpus Godfrey et al. (1992), consisting of full train/dev/test sets of automatically generated trees and manually annotated 100 sentences for a detailed evaluation. Additionally, we manually construct experimental data for parsing (4) math problems Seo et al. (2015), for which the importance of domain adaptation is previously demonstrated Joshi et al. (2018). We observe huge additive gains in the performance of the depccg parser Yoshikawa et al. (2017), by combining contextualized word embeddings Peters et al. (2018) and our domain adaptation method: in terms of unlabeled F1 scores, 90.68% to 95.63% on speech conversation, and 88.49% to 95.83% on math problems, respectively.222 All the programs and resources used in this work are available at: https://github.com/masashi-y/depccg.
2 Combinatory Categorial Grammar
CCG is a lexicalized grammatical formalism, where words and phrases are assigned categories with complex internal structures. A category (or ) represents a phrase that combines with a phrase on its right (or left), and becomes an phrase. As such, a category represents an English transitive verb which takes s on both sides and becomes a sentence ().
The semantic structure of a sentence can be extracted using the functional nature of CCG categories. Figure 2 shows an example CCG derivation of a phrase cats that Kyle wants to see, where categories are marked with variables and constants (e.g., in ), and argument ids in the case of verbs (subscripts in ). Unification is performed on these variables and constants in the course of derivation, resulting in chains of equations , and , successfully recovering the first and second argument of see: Kyle and cats (i.e., capturing long-range dependencies). What is demonstrated here is performed in the standard evaluation of CCG parsing, where the number of such correctly predicted predicate-argument relations is calculated (for the detail, see Clark et al. (2002)). Remarkably, it is also the basis of CCG-based semantic parsing Abzianidze (2017); Martínez-Gómez et al. (2017); Matsuzaki et al. (2017), where the above simple unification rule is replaced with more sophisticated techniques such as -calculus.
There are two major resources in CCG: the English CCGbank Hockenmaier and Steedman (2007) for news texts, and the Groningen Meaning Bank Bos et al. (2017) for wider domains, including Aesop’s fables. However, when one wants a CCG parser tuned for a specific domain, he or she faces the issue of its high annotation cost:
- •
The annotation requires linguistic expertise, being able to keep track of semantic composition performed during a derivation.
- •
An annotated tree must strictly conform to the grammar, e.g., inconsistencies such as combining and result in ill-formed trees and hence must be disallowed.
We relax these assumptions by using dependency tree, which is a simpler representation of the syntactic structure, i.e., it lacks information of long-range dependencies and conjunct spans of a coordination structure. However, due to its simplicity and flexibility, it is easier to train an annotator, and there exist plenty of accessible dependency-based resources, which we exploit in this work.
3 Dependency-to-CCG Converter
We propose a domain adaptation method based on the automatic generation of a CCGbank out of a dependency treebank in the target domain. This is achieved by our dependency-to-CCG converter, a neural network model consisting of a dependency tree encoder and a CCG tree decoder.
In the encoder, higher-order interactions among dependency edges are modeled with a bidirectional TreeLSTM Miwa and Bansal (2016), which is important to facilitate mapping from a dependency tree to a more complex CCG tree. Due to the strict nature of CCG grammar, we model the output space of CCG trees explicitly333 The strictness and the large number of categories make it still hard to leave everything to neural networks to learn. We trained constituency-based RSP parser Joshi et al. (2018) on the English CCGbank by disguising the trees as constituency ones, whose performance could not be evaluated since most of the output trees violated the grammar. ; our decoder is inspired by the recent success of A* CCG parsing Lewis and Steedman (2014a); Yoshikawa et al. (2017), where the most probable valid tree is found using A* parsing Klein and D. Manning (2003). In the following, we describe the details of the proposed converter.
Firstly, we define a probabilistic model of the dependency-to-CCG conversion process. According to Yoshikawa et al. (2017), the structure of a CCG tree for sentence is almost uniquely determined444 The uniqueness is broken if a tree contains a unary node. if a sequence of the pre-terminal CCG categories (supertags) and a dependency structure , where is an index of dependency parent of (0 represents a root node), are provided. Note that the dependency structure is generally different from an input dependency tree.555 In this work, input dependency tree is based on Universal Dependencies Nivre et al. (2016), while dependency structure of a CCG tree is Head First dependency tree introduced in Yoshikawa et al. (2017). See § 5 for the detail.
While supertags are highly informative about the syntactic structure Bangalore and Joshi (1999), remaining ambiguities such as attachment ambiguities need to be modeled using dependencies. Let the input dependency tree of sentence be , where is a part-of-speech tag of , an index of its dependency parent, is the label of the corresponding dependency edge, then the conversion process is expressed as follows:666 Here, the independence of each s and s is assumed.
[TABLE]
Based on this formulation, we model and conditioned on a dependency tree , and search for that maximizes using A* parsing.
Encoder
A bidirectional TreeLSTM consists of two distinct TreeLSTMs Tai et al. (2015). A bottom-up TreeLSTM recursively computes a hidden vector for each , from vector representation of the word and hidden vectors of its dependency children . A top-down TreeLSTM, in turn, computes using and a hidden vector of the dependency parent . In total, a bidirectional TreeLSTM returns concatenations of hidden vectors for all words: .
We encode a dependency tree as follows, where denotes the vector representation of variable , and and are shorthand notations of the series of operations of sequential and tree bidirectional LSTMs, respectively:
[TABLE]
Decoder
The decoder part adopts the same architecture as in Yoshikawa et al. (2017), where probabilities are computed on top of , using a biaffine layer Dozat and Manning (2017) and a bilinear layer, respectively, which are then used in A* parsing to find the most probable CCG tree.
Firstly a biaffine layer is used to compute unigram head probabilities as follows:
[TABLE]
where denotes a multi-layer perceptron. The probabilities are computed by a bilinear transformation of vector encodings and , where is the most probable dependency head of with respect to : .
[TABLE]
A* Parsing
Since the probability of a CCG tree is simply decomposable into probabilities of subtrees, the problem of finding the most probable tree can be solved with a chart-based algorithm. In this work, we use one of such algorithms, A* parsing Klein and D. Manning (2003). A* parsing is a generalization of A* search for shortest path problem on a graph, and it controls subtrees (corresponding to a node in a graph case) to visit next using a priority queue. We follow Yoshikawa et al. (2017) exactly in formulating our A* parsing, and adopt an admissible heuristic by taking the sum of the max probabilities outside a subtree. The advantage of employing an A* parsing-based decoder is not limited to the optimality guarantee of the decoded tree; it enables constrained decoding, which is described next.
4 Constrained Decoding
While our method is a fully automated treebank generation method, there are often cases where we want to control the form of output trees by using external language resources. For example, when generating a CCGbank for biomedical domain, it will be convenient if a disease dictionary is utilized to ensure that a complex disease name in a text is always assigned the category . In our decoder based on A* parsing, it is possible to perform such a controlled generation of a CCG tree by imposing constraints on the space of trees.
A constraint is a triplet representing a constituent of category spanning over words . The constrained decoding is achieved by refusing to add a subtree (denoted as , likewise, with its category and span) to the priority queue when it meets one of the conditions:
- •
The spans overlap: or .
- •
The spans are identical ( and ), while the categories are different () and no category exists such that is a valid unary rule.
The last condition on unary rule is necessary to prevent structures such as from being accidentally discarded, when using a constraint to make a noun phrase to be . A set of multiple constraints are imposed by checking the above conditions for each of the constraints when adding a new item to the priority queue. When one wants to constrain a terminal category to be , that is achieved by manipulating : and for all categories , .
5 Experiments
5.1 Experimental Settings
We evaluate our method in terms of performance gain obtained by fine-tuning an off-the-shelf CCG parser depccg Yoshikawa et al. (2017), on a variety of CCGbanks obtained by converting existing dependency resources using the method.
In short, the method of depccg is equivalent to omitting the dependence on a dependency tree from of our converter model, and running an A* parsing-based decoder on calculated on , as in our method. In the plain depccg, the word representation is a concatenation of GloVe777 https://nlp.stanford.edu/projects/glove/ vectors and vector representations of affixes. As in the previous work, the parser is trained on both the English CCGbank Hockenmaier and Steedman (2007) and the tri-training dataset by Yoshikawa et al. (2017). In this work, on top of that, we include as a baseline a setting where the affix vectors are replaced by contextualized word representation (ELMo; Peters et al. (2018)) (),888 We used the “original” ELMo model, with 1,024-dimensional word vector outputs (https://allennlp.org/elmo).
which we find marks the current best scores in the English CCGbank parsing (Table 1).
The evaluation is based on the standard evaluation metric, where the number of correctly predicted predicate argument relations is calculated (§2), where labeled metrics take into account the category through which the dependency is constructed, while unlabeled ones do not.
Implementation Details
The input word representations to the converter are the concatenation of GloVe and ELMo representations. Each of and is randomly initialized 50-dimensional vectors, and the two-layer sequential LSTMs outputs 300 dimensional vectors, as well as bidirectional TreeLSTM , whose outputs are then fed into 1-layer 100-dimensional MLPs with ELU non-linearity Clevert et al. (2016). The training is done by minimizing the sum of negative log likelihood of using the Adam optimizer (with ), on a dataset detailed below.
Data Processing
In this work, the input tree to the converter follows Universal Dependencies (UD) v1 Nivre et al. (2016). Constituency-based treebanks are converted using the Stanford Converter999 https://nlp.stanford.edu/software/stanford-dependencies.shtml. We used the version 3.9.1. to obtain UD trees. The output dependency structure follows Head First dependency tree Yoshikawa et al. (2017), where a dependency arc is always from left to right. The conversion model is trained to map UD trees in the Wall Street Journal (WSJ) portion 2-21 of the Penn Treebank Marcus et al. (1993) to its corresponding CCG trees in the English CCGbank Hockenmaier and Steedman (2007).
Fine-tuning the CCG Parser
In each of the following domain adaptation experiments, newly obtained CCGbanks are used to fine-tune the parameters of the baseline parser described above, by re-training it on the mixture of labeled examples from the new target-domain CCGbank, the English CCGbank, and the tri-training dataset.
5.2 Evaluating Converter’s Performance
First, we examine whether the trained converter can produce high-quality CCG trees, by applying it to dependency trees in the test portion (WSJ23) of Penn Treebank and then calculating the standard evaluation metrics between the resulting trees and the corresponding gold trees (Table 1). This can be regarded as evaluating the upper bound of the conversion quality, since the evaluated data comes from the same domain as the converter’s training data. Our converter shows much higher scores compared to the current best-performing depccg combined with ELMo (1.5% and 2.17% up in unlabeled/labeled F1 scores), suggesting that, using the proposed converter, we can obtain CCGbanks of high quality.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abzianidze (2017) Lasha Abzianidze. 2017. Lang Pro: Natural Language Theorem Prover . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages 115–120. Association for Computational Linguistics.
- 2Ambati et al. (2013) Bharat Ram Ambati, Tejaswini Deoskar, and Mark Steedman. 2013. Using CCG categories to improve Hindi dependency parsing . In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics , pages 604–609. Association for Computational Linguistics.
- 3Bangalore and Joshi (1999) Srinivas Bangalore and Aravind K. Joshi. 1999. Supertagging: An Approach to Almost Parsing . Computational Linguistics , 25(2):237–265.
- 4Bos et al. (2017) Johan Bos, Valerio Basile, Kilian Evang, Noortje J. Venhuizen, and Johannes Bjerva. 2017. The Groningen Meaning Bank . In Handbook of Linguistic Annotation , pages 463–496. Springer Netherlands. · doi ↗
- 5Bos et al. (2009) Johan Bos, Bosco Cristina, and Mazzei Alessandro. 2009. Converting a Dependency Treebank to a Categorial Grammar Treebank for Italian. In In Proceedings of the Eighth International Workshop on Treebanks and Linguistic Theories , pages 27–38.
- 6Clark and Curran (2007) Stephen Clark and James R. Curran. 2007. Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models . Computational Linguistics , 33(4):493–552.
- 7Clark et al. (2002) Stephen Clark, Julia Hockenmaier, and Mark Steedman. 2002. Building Deep Dependency Structures with a Wide-coverage CCG Parser . In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , pages 327–334. Association for Computational Linguistics. · doi ↗
- 8Clevert et al. (2016) Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2016. Fast and Accurate Deep Network Learning by Exponential Linear Units (EL Us). ICLR .
