Tracking Discrete and Continuous Entity State for Process Understanding
Aditya Gupta, Greg Durrett

TL;DR
This paper introduces a neural architecture that models both discrete and continuous aspects of entity states in procedural text, improving process understanding and achieving state-of-the-art results on QA tasks.
Contribution
The paper presents a structured neural model combining recurrent tracking of entity states with a neural CRF to enforce global constraints, a novel approach for process understanding.
Findings
Achieves state-of-the-art results on ProPara dataset.
Effectively models entity state constraints over time.
Improves accuracy in process-related QA tasks.
Abstract
Procedural text, which describes entities and their interactions as they undergo some process, depicts entities in a uniquely nuanced way. First, each entity may have some observable discrete attributes, such as its state or location; modeling these involves imposing global structure and enforcing consistency. Second, an entity may have properties which are not made explicit but can be effectively induced and tracked by neural networks. In this paper, we propose a structured neural architecture that reflects this dual nature of entity evolution. The model tracks each entity recurrently, updating its hidden continuous representation at each step to contain relevant state information. The global discrete state structure is explicitly modeled with a neural CRF over the changing hidden representation of the entity. This CRF can explicitly capture constraints on entity states over time,…
| Tags | Description |
|---|---|
| None state before and after existence, resp. | |
| Creation and destruction event for entity, resp. | |
| Exists in the process without any state change | |
| Entity moves from to |
| Model | Task-1 | Task-2 | ||||||
|---|---|---|---|---|---|---|---|---|
| Cat-1 | Cat-2 | Cat-3 | Macro-Avg | Micro-Avg | Precision | Recall | ||
| EntNet Henaff et al. (2017) | 51.62 | 18.83 | 7.77 | 26.07 | 25.96 | 50.2 | 33.5 | 40.2 |
| QRN Seo et al. (2017) | 52.37 | 15.51 | 10.92 | 26.26 | 26.49 | 55.5 | 31.3 | 40.0 |
| ProGlobal Dalvi et al. (2018) | 62.95 | 36.39 | 35.90 | 45.08 | 45.37 | 46.7 | 52.4 | 49.4 |
| ProStruct Tandon et al. (2018) | - | - | - | - | - | 74.2 | 42.1 | 53.75 |
| KG-MRC Das et al. (2019) | 62.86 | 40.00 | 38.23 | 47.03 | 46.62 | 64.52 | 50.68 | 56.77 |
| This work: NCET | 70.55 | 44.57 | 41.34 | 52.15 | 52.31 | 64.2 | 53.9 | 58.6 |
| This work: NCET + ELMo | 73.68 | 47.09 | 41.03 | 53.93 | 53.97 | 67.1 | 58.5 | 62.5 |
| Model | C-1 | C-2 | C-3 | Mac. | Mic. |
|---|---|---|---|---|---|
| NCET | 72.27 | 46.08 | 40.82 | 53.06 | 53.13 |
| Tag Set 1 | 71.53 | 41.89 | 41.42 | 51.61 | 51.94 |
| Tag Set 2 | 71.97 | 41.85 | 39.71 | 51.18 | 51.43 |
| No trans. | 71.68 | 44.22 | 40.38 | 52.09 | 52.24 |
| No verb | 73.16 | 42.58 | 41.85 | 52.53 | 52.85 |
| Attn. | 61.69 | 22.80 | 36.44 | 40.31 | 41.38 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConditional Random Field
Tracking Discrete and Continuous Entity State for
Process Understanding
Aditya Gupta
Greg Durrett
Department of Computer Science
The University of Texas at Austin
{agupta,gdurrett}@cs.utexas.edu
Abstract
Procedural text, which describes entities and their interactions as they undergo some process, depicts entities in a uniquely nuanced way. First, each entity may have some observable discrete attributes, such as its state or location; modeling these involves imposing global structure and enforcing consistency. Second, an entity may have properties which are not made explicit but can be effectively induced and tracked by neural networks. In this paper, we propose a structured neural architecture that reflects this dual nature of entity evolution. The model tracks each entity recurrently, updating its hidden continuous representation at each step to contain relevant state information. The global discrete state structure is explicitly modelled with a neural CRF over the changing hidden representation of the entity. This CRF can explicitly capture constraints on entity states over time, enforcing that, for example, an entity cannot move to a location after it is destroyed. We evaluate the performance of our proposed model on QA tasks over process paragraphs in the ProPara dataset Dalvi et al. (2018) and find that our model achieves state-of-the-art results.
1 Introduction
Many reading comprehension question answering tasks Richardson et al. (2013); Rajpurkar et al. (2016); Joshi et al. (2017) require looking at primarily one point in the passage to answer each question, or sometimes two or three Yang et al. (2018); Welbl et al. (2018). As a result, modeling surface-level correspondences can work well Seo et al. (2017) and holistic passage comprehension is not necessary. However, certain QA settings require deeper analysis by focusing specifically on entities, asking questions about their states over time Weston et al. (2015); Long et al. (2016), combination in recipes Bosselut et al. (2018), and participation in scientific processes Dalvi et al. (2018). These settings then suggest more highly structured models as a way of dealing with the more highly structured tasks. One crucial aspect of such texts is the way an entity’s state evolves with both discrete (observable state and location changes) and continuous (changes in unobserved hidden attributes) phenomena going on. Additionally, the discrete changes unfold in a way that maintains the state consistency: an entity can not be destroyed before it even starts to exist.
In this work, we present a model which both recurrently tracks the entity in a continuous space while imposing discrete constraints using a conditional random field (CRF). We focus on the scientific process understanding setting introduced in Dalvi et al. (2018). For each entity, we instantiate a sentence-level LSTM to distill continuous state information from each of that entity’s mentions. Separate LSTMs integrate entity-location information into this process. These continuous components then produce potentials for a sequential CRF tagging layer, which predicts discrete entity states. The CRF’s problem-specific tag scheme, along with transition constraints, ensures that the model’s predictions of these observed entity properties are structurally coherent. For example, in procedural texts, this involves ensuring existence before destruction and unique creation and destruction points. Because we use global inference, identifying implicit event creation or destruction is made easier, since the model resolves conflicts among competing time steps and chooses the best time step for these events during sequence prediction.
Past approaches in the literature have typically been end-to-end continuous task specific frameworks Henaff et al. (2017); Bosselut et al. (2018), sometimes for tasks that are simpler and more synthetic Weston et al. (2015), or continuous entity-centric neural language models Clark et al. (2018); Ji et al. (2017). For process understanding specifically, past work has effectively captured global information Dalvi et al. (2018) and temporal characteristics Das et al. (2019). However, these models do not leverage the structure constraints of the problem, or only handle them heuristically Tandon et al. (2018). We find that our model outperforms these past approaches on the ProPara dataset of Dalvi et al. (2018) with a significant boost in questions concerning entity state, regardless of the location.
2 Model
We propose a structured neural model for the process paragraph comprehension task of Dalvi et al. (2018). An example from their dataset is shown in Figure 1. It consists of annotation over a process paragraph of tokens described by a sequence of sentences . A pre-specified set of entities is given as well. For each entity, gold annotation is provided consisting of the state (Exists, Moves, etc.) and location (soil, leaf) after each sentence. From this information, a set of questions about the process can be answered deterministically as outlined in Tandon et al. (2018).
Our model, as depicted in Fig. 1, consists of two core modules: (i) state tracking, and (ii) location tracking. We follow past work on neural CRFs Collobert et al. (2011); Durrett and Klein (2015); Lample et al. (2016), leveraging continuous LSTMs to distill information and a discrete CRF layer for prediction.
2.1 State Tracking
This part of the model is charged with modeling each entity’s state over time. Our model places a distribution over state sequences given a passage and an entity : .
Contextual Embeddings
Our model first computes contextual embeddings for each word in the paragraph using a single layered bidirectional LSTM. Each token word is encoded as a vector which serves as input to the LSTM. Here, is an embedding for the word produced by either pre-trained GloVe Pennington et al. (2014) or ELMo Peters et al. (2018) embeddings and is a scalar binary indicator if the current word is a verb. We denote by the LSTM’s output for the th token in .
Entity Tracking LSTM
To track entities across sentences for state changes, we use another task specific bidirectional LSTM on top of the base LSTM which operates at the sentence level. The aim of this BiLSTM is to get a continuous representation of the entity’s state at each time step, since not all time steps mention that entity. This representation can capture long-range information about the entity’s state which may not be summarized in the discrete representation.
For a fixed entity and each sentence in the paragraph, the input to the entity tracking LSTM is the contextual embedding of the mention location111We use mention location to differentiate these from the physical entity locations present in this QA domain. of the entity in , or a mask vector when the entity isn’t present in . Let denote the representation of entity in sentence . Then
[TABLE]
where and denote the contextual embeddings of the entity and the associated verb, respectively, from the base BiLSTM. In case of multiple tokens, a mean pooling over the token representations is used. Here, the information about verb is extracted using POS tags from an off-the-shelf POS tagger. The entity tracking LSTM then produces representations .
Neural CRF
We use the output of the entity tracking BiLSTM to generate emission potentials for each tag in our possible tag set at each time step :
[TABLE]
where is a learnable parameter matrix. For the specific case of entity tracking, we propose a 6 tag scheme where the tags are as follows:
Additionally, we train a transition matrix to get transition potentials between tags which we denote by and two extra tags: and . Finally, for a tag sequence , we get the probability as:
[TABLE]
2.2 Location Tracking
To complement entity’s state changes with the change in physical location of the entity, we use a separate recurrent module to predict the locations. Given a set of potential locations , where each is a continuous span in , the location predictor outputs a distribution for a passage and entity , at a given time step as .
Identifying potential locations
Instead of considering all the spans of text as candidates for potential locations, we systematically reduce the set of locations by utilizing the part of speech (POS) tags of the tokens, whereby extracting all the maximal noun and noun + adjective spans as potential physical location spans. Thus, using an off-the-shelf POS tagger, we get a set of potential locations for each . These heuristics lead to a recall classifier for locations which are not null or unk.222Major non-matching cases include long phrases like “deep in the earth”, “side of the fault line”, and “area of high elevation” where the heuristics picks “earth”, “fault line”, and “area”, respectively.
Location Tracking LSTM
For a given location and an entity , we take the mean of the hidden representations of tokens in the span of in (or else a mask vector) analogous to the input for entity state tracking LSTM, concatenating it with the mention location of the entity in , as input for time-step for the tracking this entity-location pair with . Fig. 1 shows an example where we instantiate location tracking LSTMs for each pair of entity and potential location . In the example, and .
Softmax over Location Potentials
The output of the location tracking LSTM is then used to generate potentials by for each entity and location pair for a time step . Taking softmax over the potentials gives us a probability distribution over the locations at that time step for that entity :
2.3 Learning and Inference
The full model is trained end-to-end by minimizing the negative log likelihood of the gold state tag sequence for each entity and process paragraph pair. The location predictor is only trained to make predictions when the gold location is defined for that entity in the dataset (i.e., the entity exists).
At inference time, we perform a global state change inference coupled with location prediction in a pipelined fashion. First, we use the state tracking module of the proposed model to predict the state change sequence with the maximum score using Viterbi decoding. Subsequently, we predict locations where the predicted tag is either create or move, which is sufficient to identify the object’s location at all times since these are the only points where it can change.
3 Experiments
We evaluate the performance of the proposed model on the two comprehension tasks of the ProPara dataset Dalvi et al. (2018). This dataset consists of 488 crowdsourced real world process paragraphs about 183 distinct topics in the science genre. The names of the participating entities and their existence spans are identified by expert annotators. Finally, crowd workers label locations of participant entities at each time step (sentence). The final data consists of 3.3k sentence with an average of 6.7 sentences and 4.17 entities per process paragraph. We compare our model, the Neural CRF Entity Tracking (NCET) model, with benchmark systems from past work.
3.1 Task 1: Sentence Level
This comprehension task concerns answering 10 fine grained sentence level templated questions grouped into three categories: (Cat-1) Is Created (Moved, Destroyed) in the process (yes/no for each)? (Cat-2) When was Created (Moved, Destroyed)? (Cat-3) Where was Created, (Moved from/to, Destroyed)? The ground truth for these questions were extracted by the application of simple rules to the annotated location state data. Note that Cat-1 and Cat-2 can be answered from our state-tracking model alone, and only Cat-3 involves location.
As shown in Table 2, our model using GloVe achieves state of the art performance on the test set. The performance gain is attributed to the gains in Cat-1 and Cat-2 ( and absolute), owing to the structural constraints imposed by the CRF layer. The gain in Cat-3 is relatively lower as it is the only sub-task involving location tracking. Additionally, using the frozen ELMo embedding the performance further improves with major improvements in Cat-1 and Cat-2.
3.2 Task 2: Document Level
The document level evaluation tries to capture a more global context where the templated333Inputs refer to the entities which existed prior to the process and are destroyed during it. Outputs refer to the entities which get created in the process without subsequent destruction. Conversion refers to the simultaneous event which involves creation of some entities coupled with destruction of others. questions set forth concern about the whole paragraph structure: (i) What are the inputs to the process? (ii) What are the outputs of the process? (iii) What conversions occur, when and where? (iv) What movements occur, when and where? Table 2 shows the performance of the model on this task. We achieve state of the art results with a of .
3.3 Model Ablations
We now examine the performance of the model by comparing its variants along two different dimensions: (i) modifying the structural constraints for the CRF layer, and (ii) making changes to the continuous entity tracking.
Discrete Structural Constraints
We experiment with two new tag schemes: (i) , and (ii) . As shown in Table 3, the proposed 6 tag scheme outperforms the simpler tag schemes indicating that the model is able to gain more from a better structural annotation. Additionally, we experiment with removing the transition features from our CRF layer, though we still use structural constraints. Taken together, these results show that carefully capturing the domain constraints in how entities change over time is an important factor in our model.
Continuous Entity Tracking
To evaluate the importance of different modules in our continuous entity tracking model, we experiment with (i) removing the verb information, and (ii) taking attention-based input for the entity tracking LSTM instead of the entity-mention information. This way instead of giving a hard attention by focusing exactly on the entity, we let the model learn soft attention across the tokens for each time-step. The model can now learn to look anywhere in a sentence for entity information, but is not given prior knowledge of how to do so. As shown, using attention-based input for entity tracking performs substantially worse, indicating the structural importance of passing the mask vector.
4 Conclusion
In this paper, we present a structured architecture for entity tracking which leverages both the discrete and continuous characterization of the entity evolution. We use a neural CRF approach to model our discrete constraints while tracking entities and locations recurrently. Our model achieves state of the art results on the ProPara dataset.
Acknowledgments
This work was partially supported by NSF Grant IIS-1814522, NSF Grant SHF-1762299, a Bloomberg Data Science Grant, and an equipment grant from NVIDIA. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources used to conduct this research. Thanks to the anonymous reviewers for their helpful comments.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bosselut et al. (2018) Antoine Bosselut, Corin Ennis, Omer Levy, Ari Holtzman, Dieter Fox, and Yejin Choi. 2018. Simulating Action Dynamics with Neural Process Networks. In Proceedings of the International Conference on Learning Representations (ICLR) .
- 2Clark et al. (2018) Elizabeth Clark, Yangfeng Ji, and Noah A. Smith. 2018. Neural Text Generation in Stories Using Entity Representations as Context. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (ACL): Human Language Technologies .
- 3Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research , 12(Aug):2493–2537.
- 4Dalvi et al. (2018) Bhavana Dalvi, Lifu Huang, Niket Tandon, Wen-tau Yih, and Peter Clark. 2018. Tracking State Changes in Procedural Text: a Challenge Dataset and Models for Process Paragraph Comprehension. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies .
- 5Das et al. (2019) Rajarshi Das, Tsendsuren Munkhdalai, Xingdi Yuan, Adam Trischler, and Andrew Mc Callum. 2019. Building Dynamic Knowledge Graphs from Text using Machine Reading Comprehension. In Proceedings of the International Conference on Learning Representations (ICLR) .
- 6Durrett and Klein (2015) Greg Durrett and Dan Klein. 2015. Neural CRF Parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL) and the 7th International Joint Conference on Natural Language Processing .
- 7Henaff et al. (2017) Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann Le Cun. 2017. Tracking the World State with Recurrent Entity Networks. In Proceedings of the International Conference on Learning Representations (ICLR) .
- 8Ji et al. (2017) Yangfeng Ji, Chenhao Tan, Sebastian Martschat, Yejin Choi, and Noah A. Smith. 2017. Dynamic Entity Representations in Neural Language Models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing .
