Exploiting Domain Knowledge via Grouped Weight Sharing with Application to Text Categorization
Ye Zhang, Matthew Lease, Byron C. Wallace

TL;DR
This paper introduces a novel method for leveraging external linguistic resources in neural NLP models through grouped weight sharing, leading to improved classification performance.
Contribution
It presents a new approach that uses weight sharing to incorporate domain knowledge into neural models, moving beyond traditional model compression techniques.
Findings
Improved classification accuracy with external resources
Consistent performance gains over baseline models
Flexible integration of prior knowledge into neural networks
Abstract
A fundamental advantage of neural models for NLP is their ability to learn representations from scratch. However, in practice this often means ignoring existing external linguistic resources, e.g., WordNet or domain specific ontologies such as the Unified Medical Language System (UMLS). We propose a general, novel method for exploiting such resources via weight sharing. Prior work on weight sharing in neural networks has considered it largely as a means of model compression. In contrast, we treat weight sharing as a flexible mechanism for incorporating prior knowledge into neural models. We show that this approach consistently yields improved performance on classification tasks compared to baseline strategies that do not exploit weight sharing.
| total #instances | vocabulary size | #positive instances | #negative instances | |
|---|---|---|---|---|
| MR | 10662 | 18765 | 5331 | 5331 |
| CR | 3773 | 5340 | 2406 | 1367 |
| MPQA | 10604 | 6246 | 3311 | 7293 |
| AN | 5653 | 5554 | 653 | 5000 |
| CL | 8288 | 3684 | 768 | 7520 |
| ST | 3464 | 2965 | 173 | 3291 |
| PB | 4749 | 3086 | 243 | 4506 |
| Method | MR | CR | MPQA |
|---|---|---|---|
| p only | 81.02 (80.84,81.24) | 84.34 (84.21,84.53) | 89.41 (89.22,89.58) |
| p + r | 81.25 (81.19,81.32) | 84.33 (84.24,84.38) | 89.63 (89.58,89.71) |
| p + retro | 81.35 (81.23,81.51) | 84.16 (84.09,84.28) | 89.61 (89.48,89.77) |
| p + S (no sharing) | 81.39 (81.32,81.43) | 84.13 (84.06,84.21) | 89.71 (89.67,89.75) |
| p + B (no sharing) | 81.50 (81.29,81.63) | 84.60 (84.53,84.66) | 89.57 (89.52,89.61) |
| p + S (sharing) | 81.69 (81.60,81.78) | 84.34 (84.24,84.43) | 89.84 (89.74,90.13) |
| p + B (sharing) | 81.83 (81.80,81.87) | 84.68 (84.64,84.72) | 89.97 (89.74,90.13) |
| Method | AN | CL | ST | PB |
|---|---|---|---|---|
| p only | 86.63 (86.57,86.67) | 88.73 (88.51,89.00) | 67.15 (66.00, 67.91) | 90.11 (89.46, 91.03) |
| p + r | 85.67 (85.46,85.95) | 88.87 (88.56,89.03) | 67.72 (67.65,67.86) | 90.12 (89.87,90.47) |
| p + retro | 86.46 (86.32,86.65) | 89.27 (88.89,90.01) | 67.78 (67.56,68.00) | 90.07 (89.92,90.20) |
| p + U | 86.60 (86.32,87.01) | 88.93 (88.67,89.13) | 67.78 (67.71,67.85) | 90.23 (89.84,90.47) |
| p + U(s) | 87.15 (87.00,87.29) | 89.29 (89.09,89.51) | 67.73 (67.58,67.88) | 90.99 (90.46,91.59) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
Exploiting Domain Knowledge via Grouped Weight Sharing
with Application to Text Categorization
Ye Zhang1 Matthew Lease2 **Byron C. Wallace3
1**Department of Computer Science, University of Texas at Austin
2School of Information, University of Texas at Austin
3College of Computer & Information Science, Northeastern University
[email protected], [email protected], [email protected]
Abstract
A fundamental advantage of neural models for NLP is their ability to learn representations from scratch. However, in practice this often means ignoring existing external linguistic resources, e.g., WordNet or domain specific ontologies such as the Unified Medical Language System (UMLS). We propose a general, novel method for exploiting such resources via weight sharing. Prior work on weight sharing in neural networks has considered it largely as a means of model compression. In contrast, we treat weight sharing as a flexible mechanism for incorporating prior knowledge into neural models. We show that this approach consistently yields improved performance on classification tasks compared to baseline strategies that do not exploit weight sharing.
1 Introduction
Neural models are powerful in part due to their ability to learn good representations of raw textual inputs, mitigating the need for extensive task-specific feature engineering Collobert et al. (2011). However, a downside of learning from scratch is failing to capitalize on prior linguistic or semantic knowledge, often encoded in existing resources such as ontologies. Such prior knowledge can be particularly valuable when estimating highly flexible models. In this work, we address how to exploit known relationships between words when training neural models for NLP tasks.
We propose exploiting the feature-hashing trick, originally proposed as a means of neural network compression Chen et al. (2015). Here we instead view the partial parameter sharing induced by feature hashing as a flexible mechanism for tying together network node weights that we believe to be similar a priori. In effect, this acts as a regularizer that constrains the model to learn weights that agree with the domain knowledge codified in external resources like ontologies.
More specifically, as external resources we use Brown clusters Brown et al. (1992), WordNet Miller (1995) and the Unified Medical Language System (UMLS) Bodenreider (2004). From these we derive groups of words with similar meaning. We then use feature hashing to share a subset of weights between the embeddings of words that belong to the same semantic group(s). This forces the model to respect prior domain knowledge, insofar as words similar under a given ontology are compelled to have similar embeddings.
Our contribution is a novel, simple and flexible method for injecting domain knowledge into neural models via stochastic weight sharing. Results on seven diverse classification tasks (three sentiment and four biomedical) show that our method consistently improves performance over (1) baselines that fail to capitalize on domain knowledge, and (2) an approach that uses retrofitting Faruqui et al. (2014) as a preprocessing step to encode domain knowledge prior to training.
2 Grouped Weight Sharing
We incorporate similarity relations codified in existing resources (here derived from Brown clusters, SentiWordNet and the UMLS) as prior knowledge in a Convolutional Neural Network (CNN).111The idea of sharing weights to reflect known similarity is general and could be applied with other neural architectures. To achieve this we construct a shared embedding matrix such that words known a priori to be similar are constrained to share some fraction of embedding weights.
Concretely, suppose we have groups of words derived from an external resource. Note that one could derive such groups in several ways; e.g., using the synsets in SentiWordNet. We denote groups by . Each group is associated with an embedding , which we initialize by averaging the pre-trained embeddings of each word in the group.
To exploit both grouped and independent word weights, we adopt a two-channel CNN model Zhang et al. (2016b). The embedding matrix of the first channel is initialized with pre-trained word vectors. We denote this input by ( is the vocabulary size and the dimension of the word embeddings). The second channel input matrix is initialized with our proposed weight-sharing embedding . is initialized by drawing from both and the external resource following the process we describe below.
Given an input text sequence of length , we construct sequence embedding representations and using the corresponding embedding matrices. We then apply independent sets of linear convolution filters on these two matrices. Each filter will generate a feature map vector ( is the filter height). We perform 1-max pooling over each , extracting one scalar per feature map. Finally, we concatenate scalars from all of the feature maps (from both channels) into a feature vector which is fed to a softmax function to predict the label (Figure 2).
We initialize as follows. Each row of is the embedding of word . Words may belong to one or more groups. A mapping function retrieves the groups that word belongs to, i.e., returns a subset of , which we denote by , where is the number of groups that contain word . To initialize , for each dimension of each word embedding , we use a hash function to map (hash) the index to one of the group IDs: . Following Weinberger et al. (2009); Shi et al. (2009), we use a second hash function to remove bias induced by hashing. This is a signing function, i.e., it maps tuples to 222Empirically, we found that using this signing function does not affect performance.. We then set to the product of and . and are both approximately uniform hash functions. Algorithm 1 provides the full initialization procedure.
For illustration, consider Figure 1. Here contains three words: good, nice and amazing, while has two words: good and interesting. The group embeddings , are initialized as averages over the pre-trained embeddings of the words they comprise. Here, embedding parameters and are both mapped to , and thus share this value. Similarly, and will share value at . We have elided the second hash function from this figure for simplicity.
During training, we update as usual using back-propagation Rumelhart et al. (1986). We update and group embeddings in a manner similar to Chen et al. Chen et al. (2015). In the forward propagation before each training step (mini-batch), we derive the value of from :
[TABLE]
We use this newly updated to perform forward propagation in the CNN.
During backward propagation, we first compute the gradient of , and then we use this to derive the gradient w.r.t . To do this, for each dimension in , we aggregate the gradients w.r.t whose elements are mapped to this dimension:
[TABLE]
where when , and 0 otherwise. Each training step involves executing Equations 1 and 2. Once the shared gradient is calculated, gradient descent proceeds as usual. We update all parameters aside from the shared weights in the standard way.
The number of parameters in our approach scales linearly with the number of channels. But the gradients can actually be back-propagated in a distributed way for each channel, since the convolutional and embedding layers are independent across these. Thus training time scales approximately linearly with the number of parameters in one channel (if the gradient is back-propagated in a distributed way).
3 Experimental Setup
3.1 Datasets
We use three sentiment datasets: a movie review (MR) dataset Pang and Lee (2005)333www.cs.cornell.edu/people/pabo/movie-review-data/; a customer review (CR) dataset Hu and Liu (2004)444www.cs.uic.edu/l̃iub/FBS/sentiment-analysis.html; and an opinion dataset (MPQA) Wiebe et al. (2005)555mpqa.cs.pitt.edu/corpora/mpqa_corpus/.
We also use four biomedical datasets, which concern systematic reviews. The task here is to classify published articles describing clinical trials as relevant or not to a well-specified clinical question. Articles deemed relevant are included in the corresponding review, which is a synthesis of all pertinent evidence Wallace et al. (2010). We use data from reviews that concerned: clopidogrel (CL) for cardiovascular conditions Dahabreh et al. (2013); biomarkers for assessing iron deficiency in anemia (AN) experienced by patients with kidney disease Chung et al. (2012); statins (ST) Cohen et al. (2006); and proton beam (PB) therapy Terasawa et al. (2009).
3.2 Implementation Details and Baselines
We use SentiWordNet Baccianella et al. (2010)666sentiwordnet.isti.cnr.it for the sentiment tasks. SentiWordNet assigns to each synset of wordnet three sentiment scores: positivity, negativity and objectivity, constrained to sum to 1. We keep only the synsets with positivity or negativity scores greater than 0, i.e., we remove synsets deemed objective. The synsets in SentiWordNet constitute our groups. We also use the Brown clustering algorithm777github.com/percyliang/brown-cluster on the three sentiment datasets. We generate 1000 clusters and treat each as a group.
For the biomedical datasets, we use the Medical Subject Headings (MeSH) terms888www.nlm.nih.gov/bsd/disted/meshtutorial/ attached to each abstract to classify them. Each MeSH term has a tree number indicating the path from the root in the UMLS. For example, ‘Alagille Syndrome’ has tree number ‘C06.552.150.125’; periods denote tree splits, numbers are nodes. We induce groups comprising MeSH terms that share the same first three parent nodes, e.g., all terms with ‘C06.552.150’ as their tree number prefix constitute one group.
We compare our approach to several baselines. All use pre-trained embeddings to initialize , but we explore several approaches to exploiting : (1) randomly initialize ; (2) initialize to reflect the group embedding , but do not share weights during the training process, i.e., do not constrain their weights to be equal when we perform back-propagation; (3) use linguistic resources to retro-fit Faruqui et al. (2014) the pre-trained embeddings, and use these to initialize . For retro-fitting, we first construct a graph derived from SentiWordNet. Then we run belief-propagation on the graph to encourage linked words to have similar vectors. This is a pre-processing step only; we do not impose weight sharing constraints during training.
For the sentiment datasets we use three filter heights (3,4,5) for each of the two CNN channels. For the biomedical datasets, we use only one filter height (1), because the inputs are unstructured MeSH terms.999For this work we are ignoring title and abstract texts. In both cases we use 100 filters of each unique height. For the sentiment datasets, we use Google word2vec Mikolov et al. (2013)101010code.google.com/archive/p/word2vec/ to initialize . For the biomedical datasets, we use word2vec trained on biomedical texts Moen and Ananiadou (2013)111111bio.nlplab.org/ to initialize . For parameter estimation, we use Adadelta Zeiler (2012). Because the biomedical datasets are imbalanced, we use downsampling Zhang et al. (2016a); Zhang and Wallace (2015) to effectively train on balanced subsets of the data.
We developed our approach using the MR sentiment dataset, tuning our approach to constructing groups from the available resources – experiments on other sentiment datasets were run after we finalized the model and hyperparameters. Similarly, we used the anemia (AN) review as a development set for the biomedical tasks, especially w.r.t. constructing groups from MeSH terms using UMLS.
4 Results
We replicate each experiment five times (each is a 10-fold cross validation), and report the mean (min, max) across these replications. Results on the sentiment and biomedical corpora in are presented in Tables 2 and 3, respectively.121212Sentiment task results are not directly comparable to prior work due to different preprocessing steps. These exploit different external resources to induce the word groupings that in turn inform weight sharing. We report AUC for the biomedical datasets because these are highly imbalanced (see Table 1).
Our method improves performance compared to all relevant baselines (including an approach that also exploits external knowledge via retrofitting) in six of seven cases. Informing weight initialization using external resources improves performance independently, but additional gains are realized by also enforcing sharing during training.
We note that our aim here is not necessarily to achieve state-of-art results on any given dataset, but rather to evaluate the proposed method for incorporating external linguistic resources into neural models via weight sharing. We have therefore compared to baselines that enable us to assess this.
5 Related Work
Neural Models for NLP. Recently there has been enormous interest in neural models for NLP generally Collobert et al. (2011); Goldberg (2016). Most relevant to this work, simple CNN based models (which we have built on here) have proven extremely effective for text categorization Kim (2014); Zhang and Wallace (2015).
Exploiting Linguistic Resources. A potential drawback to learning from scratch in end-to-end neural models is a failure to capitalize on existing knowledge sources. There have been efforts to exploit such resources specifically to induce better word vectors Yu and Dredze (2014); Faruqui et al. (2014); Yu et al. (2016); Xu et al. (2014). But these models do not attempt to exploit external resources jointly during training for a particular downstream task (which uses word embeddings as inputs), as we do here.
Past work on sparse linear models has shown the potential of exploiting linguistic knowledge in statistical NLP models. For example, Yogatama and Smith Yogatama and Smith (2014) used external resources to inform structured, grouped regularization of log-linear text classification models, yielding improvements over standard regularization approaches. Elsewhere, Doshi-Velez et al. Doshi-Velez et al. (2015) proposed a variant of LDA that exploits a priori known tree-structured relations between tokens (e.g., derived from the UMLS) in topic modeling.
Weight-sharing in NNs. Recent work has considered stochastically sharing weights in neural models. Notably, Chen et al. Chen et al. (2015). proposed randomly sharing weights in neural networks. Elsewhere, Han et al. Han et al. (2015) proposed quantized weight sharing as an intermediate step in their deep compression model. In these works, the primary motivation was model compression, whereas here we view the hashing trick as a mechanism to encode domain knowledge.
6 Conclusion
We have proposed a novel method for incorporating prior semantic knowledge into neural models via stochastic weight sharing. We have showed it generally improves text classification performance vs. model variants which do not exploit external resources and vs. an approach based on retrofitting prior to training. In future work, we will investigate generalizing our approach beyond classification, and to inform weight sharing using other varieties and sources of linguistic knowledge.
Acknowledgements. This work was made possible by NPRP grant NPRP 7-1313-1-245 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Baccianella et al. (2010) Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In LREC . volume 10, pages 2200–2204.
- 2Bodenreider (2004) Olivier Bodenreider. 2004. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research 32(suppl 1):D 267–D 270.
- 3Brown et al. (1992) Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992. Class-based n-gram models of natural language. Computational linguistics 18(4):467–479.
- 4Chen et al. (2015) Wenlin Chen, James T Wilson, Stephen Tyree, Kilian Q Weinberger, and Yixin Chen. 2015. Compressing neural networks with the hashing trick. In ICML . pages 2285–2294.
- 5Chung et al. (2012) Mei Chung, Denish Moorthy, Nira Hadar, Priyanka Salvi, Ramon C Iovin, and Joseph Lau. 2012. Biomarkers for Assessing and Managing Iron Deficiency Anemia in Late-Stage Chronic Kidney Disease . AHRQ Comparative Effectiveness Reviews. Agency for Healthcare Research and Quality (US), Rockville (MD).
- 6Cohen et al. (2006) Aaron M Cohen, William R Hersh, K Peterson, and Po-Yin Yen. 2006. Reducing workload in systematic review preparation using automated citation classification. Journal of the American Medical Informatics Association 13(2):206–219.
- 7Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12(Aug):2493–2537.
- 8Dahabreh et al. (2013) Issa J Dahabreh, Denish Moorthy, Jenny L Lamont, Minghua L Chen, David M Kent, and Joseph Lau. 2013. Testing of cyp 2c 19 variants and platelet reactivity for guiding antiplatelet treatment .
