Extracting Victim Counts from Text

Mian Zhong; Shehzaad Dhuliawala; Niklas Stoehr

arXiv:2302.12367·cs.CL·February 27, 2023

Extracting Victim Counts from Text

Mian Zhong, Shehzaad Dhuliawala, Niklas Stoehr

PDF

1 Repo

TL;DR

This paper presents a novel approach to extracting victim counts from textual reports during crises by framing it as a question answering task, comparing various models, and analyzing their robustness and reliability.

Contribution

It introduces a QA-based framework for victim count extraction, evaluates multiple models including large language models, and provides practical recommendations for deployment in humanitarian contexts.

Findings

01

QA framing improves extraction accuracy

02

Advanced models outperform regex and dependency parsing

03

Models show varying robustness in out-of-distribution scenarios

Abstract

Decision-makers in the humanitarian sector rely on timely and exact information during crisis events. Knowing how many civilians were injured during an earthquake is vital to allocate aids properly. Information about such victim counts is often only available within full-text event descriptions from newspapers and other reports. Extracting numbers from text is challenging: numbers have different formats and may require numeric reasoning. This renders purely string matching-based approaches insufficient. As a consequence, fine-grained counts of injured, displaced, or abused victims beyond fatalities are often not extracted and remain unseen. We cast victim count extraction as a question answering (QA) task with a regression or classification objective. We compare regex, dependency parsing, semantic role labeling-based approaches, and advanced text-to-text models. Beyond model accuracy,…

Tables12

Table 1. Table 1: Exact-Match and F 1 subscript 𝐹 1 F_{1} scores of the baseline models and the fine-tuned NT5-Gen on injury counts. The best results are bolded . The NT5-Gen model performs better than baselines across all datasets. DEP refers to the dependency parsing model and SRL refers to the semantic role labeling model.

	Exact-Match			$𝐅_{1}$
	WAD	NAVCO	EMM	WAD	NAVCO	EMM
Regex	0.117	0.264	0.064	0.202	0.318	0.124
Dep	0.226	0.303	0.052	0.355	0.363	0.136
SRL	0.741	0.430	0.313	0.779	0.484	0.361
NT5-Gen	0.813	0.501	0.443	0.846	0.544	0.492

Table 2. Table 2: Error examples of SRL that the NT5-Gen model is correct on extracting death counts. Diverse Expression refers to the string patterns not captured by pre-defined rules. Numerical Reasoning shows that the correct count has to be achieved by some mathematical operation over the text. Number Ambiguity indicates that a verbatim number is not written but an estimate may be made (with domain expertise). Number Spelling refers to problems with number / text format that are typos or the tokenizer parses wrongly (e.g., “twenty-three” → → \rightarrow “twenty”).

Error Type	Context	Truth	SRL	NT5
Diverse Expression	Six passengers in a taxi also had their throats cut	6	0	6
Numerical Reasoning	Herders shot and killed four people […]. Herders then shot and killed a farmer at Jokhana […]	5	4	5
Number Ambiguity	Unidentified gunmen clash with army	1	0	1
Number Spelling	.Twenty-three people were killed […]	23	1	23

Table 3. Table 3: Classification results on NAVCO injury data with the NT5-Clf model initialized by different pre-trained weights: nt5 , t5-small , and bert-base-uncased . F 1 subscript 𝐹 1 F_{1} , precision and recall scores are macro.

	Accuracy	$𝐅_{1}$	Precision	Recall
NT5	0.65	0.60	0.62	0.59
T5	0.65	0.60	0.61	0.59
BERT	0.52	0.23	0.17	0.33

Table 4. Table 4: Calibration errors of fine-tuned NT5-Clf, NT5-Reg, and NT5-Gen models before (Orig.) and after (Calib.) applying post-hoc calibration. Post-hoc calibration effectively reduces the errors.

		Death		Injury
Data	Model	Orig	Calib.	Orig	Calib.
NAVCO	Clf	0.222	0.044	0.332	0.060
	Reg	0.220	0.097	0.141	0.057
	Gen	0.054	0.040	0.092	0.092
WAD	Clf	0.192	0.055	0.228	0.088
	Reg	0.272	0.107	0.167	0.294
	Gen	0.218	0.221	0.096	0.042
EMM	Clf	0.277	0.098	0.314	0.055
	Reg	0.201	0.189	0.368	0.188
	Gen	0.087	0.092	0.328	0.122

Table 5. Table 5: Overview of pros and cons of different models. We list baselines: regular expressions ( REGEX ), dependency parsing ( DEP ), and semantic role labeling ( SRL ). The CLF , REG , GEN refer to the fine-tuned NT5-Clf, NT5-Reg, and NT5-Gen models. Absolute / Relative Error pertains to the absolute/relative error between true victim counts and model predictions taking the real numerical value of the counts (e.g., mean squared error). String Match considers string metrics like Exact-Match used in question answering. The reliability column is based on experiments in model calibration. Robustness is divided into the need for training on a large annotated dataset and the stability in out-of-distribution ( OOD ). N/A means “Not Applicable”.

	Accuracy Optimization			Reliability	Robustness		Hardware
	Absolute Error	Relative Error	String Match		Need Training	Stable in OOD
REGEX	High	Medium	Medium	N/A	No	N/A	Low
DEP	Medium	High	Low-Medium	N/A	No	N/A	Low
SRL	Low-Medium	Medium	High	N/A	No	N/A	Low-Medium
CLF	N/A	N/A	N/A	Low	Medium-High	Low	Medium - High
REG	Low	Low	N/A	Low	High	Low-Medium	Medium - High
GEN	Low	Low	Medium-High	High	Low-Medium	Medium-High	High

Table 6. Table 6: Regex patterns.

Data Type	Regex Type	Regex Pattern
Death	Passive Plural	\d(\d\|,)(?!\D(injur\|wound))(?=.(\b(were\|are)\D\b(killed\|dead\|died\|slain)))
	Passive Singular	\S(?!\D(injur\|wound))(?=.(\b(was\|is)\D\b(killed\|dead\|died\|slain)))
	Active	(kill\|slay\|slain)\D\b\d(\d\|,)
Injury	Passive Plural	\d(\d\|,)(?!.(\b(were\|are)?\D\b(killed\|dead\|died\|slain)))(?=.\b(injur\|wound))
	Passive Singular	\S(?=(was\|is).\b(injur\|wound))(?!\D(\b(were\|are)\D\b(killed\|dead\|died\|slain)))
	Active	(injured?\|wound)\D*\d+

Table 7. Table 7: Exact-Match and F 1 subscript 𝐹 1 F_{1} scores of the baseline models and the fine-tuned NT5-Gen model on death counts. Best metrics are bolded . DEP refers to the dependency parsing model and SRL refers to the semantic role labeling model.

	Exact Match			$F_{1}$
	WAD	NAVCO	EMM	WAD	NAVCO	EMM
Regex	0.3543	0.3921	0.2835	0.3897	0.4196	0.3242
Dep	0.1506	0.3526	0.0767	0.2064	0.3792	0.1317
SRL	0.4342	0.4839	0.3972	0.7794	0.4837	0.3613
NT5	0.6798	0.6590	0.6322	0.8458	0.5436	0.4917

Table 8. Table 8: Classification results on WAD death counts with the NT5-Clf model initialized by different pre-trained weights: nt5 , t5-small , and bert-base-uncased . F 1 subscript 𝐹 1 F_{1} , precision and recall scores are macro.

	Accuracy	F1 score	Precision	Recall
NT5	0.81	0.81	0.80	0.83
T5	0.81	0.81	0.81	0.84
BERT	0.86	0.86	0.86	0.88

Table 9. Table 9: Classification results on WAD injury counts with the NT5-Clf model initialized by different pre-trained weights: nt5 , t5-small , and bert-base-uncased . F 1 subscript 𝐹 1 F_{1} , precision and recall scores are macro.

	Accuracy	F1 score	Precision	Recall
NT5	0.77	0.69	0.70	0.69
T5	0.76	0.69	0.70	0.68
BERT	0.93	0.91	0.91	0.90

Table 10. Table 10: Classification results on NAVCO death counts with the NT5-Clf model initialized by different pre-trained weights: nt5 , t5-small , and bert-base-uncased . F 1 subscript 𝐹 1 F_{1} , precision and recall scores are macro.

	Accuracy	F1 score	Precision	Recall
NT5	0.65	0.60	0.62	0.59
T5	0.65	0.60	0.61	0.59
BERT	0.52	0.23	0.17	0.33

Table 11. Table 11: Classification results on EMM death counts with the NT5-Clf model initialized by different pre-trained weights: nt5 , t5-small , and bert-base-uncased . F 1 subscript 𝐹 1 F_{1} , precision and recall scores are macro.

	Accuracy	F1 score	Precision	Recall
NT5	0.72	0.65	0.66	0.65
T5	0.70	0.63	0.65	0.63
BERT	0.84	0.80	0.82	0.78

Table 12. Table 12: Classification results on EMM injury counts with the NT5-Clf model initialized by different pre-trained weights: nt5 , t5-small , and bert-base-uncased . F 1 subscript 𝐹 1 F_{1} , precision and recall scores are macro.

	Accuracy	F1 score	Precision	Recall
NT5	0.68	0.58	0.60	0.57
T5	0.68	0.58	0.59	0.57
BERT	0.81	0.77	0.79	0.76

Equations4

\mathrm{ECE}=\sum_{m=1}^{M}\frac{\lvert B_{m}\rvert}{n}\bigg{\lvert}\operatorname{acc}(B_{m})-\operatorname{conf}(B_{m})\bigg{\rvert}.

\mathrm{ECE}=\sum_{m=1}^{M}\frac{\lvert B_{m}\rvert}{n}\bigg{\lvert}\operatorname{acc}(B_{m})-\operatorname{conf}(B_{m})\bigg{\rvert}.

\mathrm{RegCE}=\frac{1}{M}\sum_{m=1}^{M}\bigg{\lvert}\operatorname{freq}(B_{m})-\sup(B_{m})\bigg{\rvert}.

\mathrm{RegCE}=\frac{1}{M}\sum_{m=1}^{M}\bigg{\lvert}\operatorname{freq}(B_{m})-\sup(B_{m})\bigg{\rvert}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mianzg/victim_counts
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Extracting Victim Counts from Text

Mian Zhong Shehzaad Dhuliawala Niklas Stoehr

Institute for Machine Learning, ETH Zürich

[email protected] [email protected] [email protected]

Abstract

Decision-makers in the humanitarian sector rely on timely and exact information during crisis events. Knowing how many civilians were injured during an earthquake is vital to allocate aids properly. Information about such victim counts is often only available within full-text event descriptions from newspapers and other reports. Extracting numbers from text is challenging: numbers have different formats and may require numeric reasoning. This renders purely string matching-based approaches insufficient. As a consequence, fine-grained counts of injured, displaced, or abused victims beyond fatalities are often not extracted and remain unseen. We cast victim count extraction as a question answering (QA) task with a regression or classification objective. We compare regex, dependency parsing, semantic role labeling-based approaches, and advanced text-to-text models. Beyond model accuracy, we analyze extraction reliability and robustness which are key for this sensitive task. In particular, we discuss model calibration and investigate few-shot and out-of-distribution performance. Ultimately, we make a comprehensive recommendation on which model to select for different desiderata and data domains. Our work is among the first to apply numeracy-focused large language models in a real-world use case with a positive impact.111Code is available online at:

https://github.com/mianzg/victim_counts

1 Introduction

Timely and accurate information during crisis events is crucial for rescue operations and the allocation of humanitarian aid Lepuschitz and Stoehr (2021). However, crisis information is often scarce, subjective, or biased, which renders reported numbers in text extremely important Hellmeier et al. (2018); Zavarella et al. (2020); Radford (2021). For instance, the count of injured or missing people provides quantitative information about the catastrophic impact of an earthquake. In this work, we focus on human victims in crisis events, e.g., fatalities in floods, herein referred to as victim counts. A reliable estimate of victim counts is helpful during crisis (Darcy and Hofmann, 2003; Kreutzer et al., 2020), and also post-crisis, benefiting research to diversify measures of crisis intensity. As of now, most intensity measures are either limited to event types Vincent (1979); Goldstein (1992), fatality counts Kalyvas (2006); Chaudoin et al. (2017) or both Stoehr et al. (2022). More fine-grained measures such as injured, displaced, or abused victims are not captured in most popular databases and remain unmonitored (Krause, 2013; Cruyff et al., 2017; Cullen et al., 2021).

Many victim counts are reported in full-text form within event descriptions in news media. This makes their systematic collection and analysis technically complex. Manual extraction of victim counts from text is very labor-intensive and does not scale to big data collections (Schrodt and Ulfelder, 2016; Lewis et al., 2016). Computerized approaches such as the event coding software Tabari Schrodt (2009) and Petrarch2 Norris et al. (2017) focus on extracting actor and event types. They rely on lambda calculus and syntactic pattern matching, but disregard mentions of victim counts.

As we will show, parsing-based approaches perform decently well at extracting explicitly reported victim counts. They can identify the mention of the count “ $5$ ” in “5 people were injured”. However, they are often inadequate when the description implies a correct count — for example, from the description that “one logger was shot but survived”, a human reader may infer that one person is injured. Since neither a count nor the injury is mentioned explicitly, a parsing-based system may fall short. Another difficulty stems from the fact that the counts can be reported in many, different formats. A reported count may be digit-based or spelled out, define an exact quantity or a range as in “dozens of people were injured”. As a consequence, formulating the task of victim count extraction is not an easy endeavor (§ 3). Most prior work assumes a setting where the count is explicitly mentioned in an event description Döhling and Leser (2011); Imran et al. (2013); Rudra et al. (2018); Camilleri et al. (2019). Such settings can be tackled by sequence labeling models that select a relevant span from the given description. However, if the victim count does not appear verbatim, as in the above “one logger” example, models with some form of abstract reasoning capacity may be needed Roy et al. (2015). Recently, large language models have shown promising results in answering number-focused questions with and without explicit mentions of relevant numbers Lewkowycz et al. (2022); Nye et al. (2021); Wei et al. (2022); Lefebvre and Stoehr (2022).

This paper is concerned with studying these different approaches (§ 4): as baselines, we compare regular expression, dependency parsing, and semantic role labeling. We consider the NT5 Yang et al. (2021) model as a representative numeracy-enhanced pre-trained language model. We use the representation of this model in a generation, a classification, and a regression setting. We evaluate all models along three dimensions: accuracy (§ 5), reliability (§ 6), and robustness (§ 7). We find that the fine-tuned language model outperforms the baseline models, especially when the victim count extraction requires reasoning. Reliability and robustness are particularly important in high-stake, human-centric tasks such as victim count extraction Zhang et al. (2020); Kong et al. (2020); Russo et al. (2022b). Model reliability indicates to which extent model behavior can be trusted within decision-making settings Leibig et al. (2017); Jiang et al. (2021). One dimension of reliability is model calibration which indicates if a model’s confidence is aligned well with it making correct predictions Guo et al. (2017). While calibration has been widely studied for classification, we add to the discussion of calibrated regression Song et al. (2019) and generation settings Widmann et al. (2021). Finally, the dimension of robustness describes how stably a model performs. For instance, when the training set is limited or when the test data is out-of-distribution, a less robust model will forfeit more of its predictive performance. To shed light on this dimension, we conduct experiments in few-shot learning and out-of-distribution settings.

We conclude with an application to showcase the extraction of fine-grained and highly specialized types of victim counts. Lastly, we discuss the benefits and drawbacks of the different approaches to assist practitioners in choosing the most suitable task formulation and model.

2 Data

We use publicly available datasets covering natural disasters and armed conflicts, namely: (1) World Atrocities Dataset (WAD) (Schrodt and Ulfelder, 2016), (2) Non-violent and Violent Campaigns and Outcomes 3.0 (NAVCO) (Lewis et al., 2016), and (3) European Media Monitor (EMM) (JRC Science Hub, 2018; Steinberger et al., 2017). For each dataset, we use the event text description and two types of victim counts: the death count and the injury count that we refer to as “WAD death” or “WAD injury”. We pre-process the data by removing the samples with missing values (NaN) in the victim counts . For EMM, we only consider samples with a non-zero victim count since “[math]” is over-represented.

3 Task Formulation

In this section, we discuss some questions and challenges faced in formulating the task of extracting victim counts from event descriptions. We justify some of the choices we make and describe why it is not possible to have a single formulation that fits all needs:

Is the victim count always present in the text?

Victim counts can be expressed in various ways in the text. When the count is expressed explicitly in the text, say “5 people were injured”, a span extraction model can effectively extract the injury count $5$ . However, in certain cases, a single explicit number might not be mentioned, and the victim count needs to be logically or algebraically inferred from the text. Consider the description “a 4-year-old girl and her mother were found dead”; a model would need to logically deduce that the victim count of death is $2$ . To handle this, we not only look at span extraction models but also experiment with models that can understand the text at a deeper level and produce a victim count.

Is the victim count always a single number?

Often, in the event description, the victim count is described as a range, such as “at least 330 people died”, or in vague terms, like “dozens were injured”. Additionally, even within a description, the victim counts for the same event can be varying, possibly because of recording the counts from different sources. This makes extracting a single exact count almost impossible. In such cases, the best a model can do is to output a close estimate of the actual victim count. Another solution would be to provide a range within which the count could lie. For a humanitarian section deciding on the quantity of aid to be deployed, a range might suffice over a single exact count. To account for this, we also look at models that are trained to output a range by classifying the victim counts into a set of binned categories.

4 Models

In § 4.1, we introduce baselines models that parse an event description and heuristically extract a victim count. We then specify the model implementation for the different task formulations in § 4.2.

4.1 Baseline Models

All baselines extract a victim count by locating the part of the text that could be relevant to victims and finding the nearby victim counts. The locating step requires a pre-defined list of words denoted as locating list. For example, to extract death counts, this list would include terms like “kill” and “die”.

Regex.

Regular expressions (regex) is a rule-based method to extract counts by string pattern matching. The patterns (App. A) are built based on active or passive voice to extract a count closest to phrases in the locating list.

Dependency Parsing.

The dependency parsing model collects all possible numeric modifiers and their dependency relationships. Since not every numeric modifier relates to victim counts, e.g., “42-year-old”, we construct dependency rules with the locating list to decide if the number is the victim count. For example, one rule is to check if the numeric modifier is for a subject phrase that would reject “ $42$ ” in the example of “42-year-old”. If no numeric modifier is found (e.g., “a journalist was injured”), additional rules use the locating list to return “ $1$ ” if the rule is satisfied and otherwise return “[math]”.

SRL.

Semantic role labeling (SRL) recursively decomposes text input into pairs of predicates and their arguments. We define a list of predicate verbs for death and injury count as the locating list. Then, we iterate over the predicate-argument pairs, check if any predicate from the locating list occurs, and extract the count from its argument if possible. If a predicate exists, the implementation returns the first number as the count if multiple are found and returns “ $1$ ” if no verbatim number is found. If no such predicate appears, the count is set to “[math]”.

4.2 Task Modeling

We perform victim count extraction using three methods: generation, regression, and classification. As discussed above, each of these approaches caters to the different formulations of our task and can be beneficial in different scenarios. Across these methods, we use the same underlying NT5 model. For clarity, we denote NT5-Gen, NT5-Reg, and NT5-Clf for the corresponding models. The NT5 model (Yang et al., 2021) is a variant of the T5 model Raffel et al. (2020) with further fine-tuning on numerical tasks. We query the model in a similar fashion to previous works by giving the question and event description in the form: “answer me:[question] context:[passage]”. We discuss how we fine-tune this model for each of our specific methods below.

Generation.

For generation, we fine-tune NT5 to decode the victim counts autoregressively. At inference, we use beam search to generate output. Generation does not guarantee to only generate numeral tokens; therefore, we follow De Cao et al. (2021) to constrain the possible generation tokens in a prefix-conditioned way, such that only number digit tokens $0-9$ and EOS token are allowed at each decoding step.

Regression.

For regression, we add two linear layers (with ReLU activation) on the encoder representation to output the numerical victim count. The model is trained to optimize the $\log$ mean-squared error between the true and predicted count.

Classification.

We model the task as a classification problem by binning the victim counts into ordinal classes. Similar to regression, the model has a classification head of a linear layer and a softmax layer on top of an encoder initialized with NT5 weights. Our experiments use a 3-class classification by converting the victim counts into three categories: $[0,3],(3,10],(10,\infty)$ .

5 Accuracy of Counts Extraction

We begin by evaluating the efficacy of our proposed methods for victim count extraction. We examine the model accuracy by comparing baselines and the fine-tuned model with a generation objective (§ 5.1). We then show the results of using classification and regression formulations (§ 5.2).

5.1 Comparing Baselines with NT5-Gen

We compare the accuracy performance of the baseline models and the fine-tuned NT5-Gen model. Tab. 1 shows the results of extracting the injury counts using Exact-Match and $F_{1}$ scores commonly used in related tasks Yang et al. (2021); Dua et al. (2019). We measure $F_{1}$ score on digitized tokens (i.e., “ $34$ ” $\rightarrow$ [“ $3$ ”, “ $4$ ”]). The fine-tuned NT5-Gen model has an accuracy boost up by $7$ - $13\%$ in Exact-Match and by $6$ - $13\%$ in $F_{1}$ score than the strongest baseline model SRL. The performance of regex and dependency parsing varies heavily across different data, which implies that the regex pattern or dependency relationship may be less helpful in finding the victim counts.

Moreover, we convert the victim counts into four bins, where the bins are selected to have a balanced number of samples in each bin. As an illustration, Fig. 1 shows the confusion matrices on the transformed injury counts. For both victim types, baseline models have a low precision to falsely return “[math]” too often. Compared with baselines, the NT5-Gen model improves to extract victim counts whose numeric values are large (e.g., $y>10$ ).

Qualitative Analysis.

We qualitatively examine error samples of the SRL model that the NT5-Gen model extracts correctly. We randomly select 20 error samples for each test set to evaluate and summarize 4 types of errors with examples in Tab. 2. Out of all errors222There are a few samples where the ground truth might be erroneous. As the event-coding requires more domain expertise within the corresponding social science discipline, we leave the discussion out of this work., $39.2\%$ belong to diverse linguistic expressions on depicting victims, $38.3\%$ contain number ambiguity, $8.3\%$ need numerical reasoning, and $5.8\%$ have spelling issues (for the tokenizer). The NT5-Gen model performs better when the count needs numerical reasoning. Even if the reasoning is not needed, SRL may fail when the linguistic expression to depict victims (e.g., “have throats cut”) is out of the pre-defined locating list (e.g., [“die”, “kill”, “slay”]). These error types are difficult for baseline models to be improved since the patterns cannot be defined beforehand.

5.2 Results on Classification and Regression

We examine the accuracy of the classification and regression formulations by comparing NT5-Clf and NT5-Reg with different initialization weights. To compare, we use t5-small and bert-base-uncased pre-trained weights for the encoder. Tab. 3 shows the classification results on NAVCO injury data. Fine-tuning t5-small and nt5 reaches comparable performance; precision and recall scores are similar, but precision is slightly higher. The scatter plots (Fig. 2) show the results of regression using different pre-trained weights with the mean squared error (MSE). For a ( $\log$ -transformed) victim count larger than $5$ , using the regression objective seems more conservative in giving small-valued predictions. The numeracy-rich NT5 weights do not particularly improve accuracy for a classification or regression objective, and employing some standard pre-trained weights might be sufficient.

6 Evaluating Reliability

Another important dimension is reliability which we evaluate through the lens of calibration (§ 6.1). As we approach the task with multiple formulations, calibration analysis is especially needed to understand whether a model is calibrated (§ 6.2), and how post-hoc calibration techniques may adjust models to be better calibrated (§ 6.3).

6.1 Preliminaries: Calibration Metrics

A well-calibrated model ensures that the confidence of the output is well aligned with the chance of the output being accurate. This is a desirable property for our task — consider a model extracts “0” when the text depicts an injured person. A calibrated model would assign very low confidence to the extracted count, which may avoid error propagation to downstream decisions, e.g., medical resource dispatch. We here introduce the expected calibration error ( $\mathrm{ECE}$ ) (Pakdaman Naeini et al., 2015), a standard metric used for classification and is extended for generation decoding Widmann et al. (2021). For regression, we apply quantile calibration error (Kuleshov et al., 2018).

Given $n$ samples, we create $M$ equal-width bins over the interval $[0,1]$ . ECE takes a weighted average on the differences between the classification accuracy and the mean confidence within each $B_{m}$ ,

[TABLE]

The quantile calibration error averages the differences between the empirical frequency $\operatorname{freq}(B_{m})$ and the upper bound of $B_{m}$ (i.e., $\sup(B_{m})$ ), where $\operatorname{freq}(B_{m})$ is the fraction of $n$ samples whose quantiles lower or equal to $\sup(B_{m})$ ,

[TABLE]

The calibration error of generation decoding takes the best $b$ beam search answers, and applies softmax on their scores to represent the confidence. The ECE is then calculated on the best beam search answer similar to classification.

6.2 Calibration Error on Different Models

We show in Tab. 4 the calibration errors measured on the fine-tuned NT5-Clf, NT5-Reg, and NT5-Gen with different data. Surprisingly, the NT5-Gen model is well-calibrated on most datasets, except for EMM injury: the lowest calibration error is $0.05$ on NAVCO death, and the errors on other data range between $0.08$ and $0.33$ . Classification models tend to have large calibration errors ( $>0.19$ ). In particular, the error is larger than $0.3$ on NAVCO and EMM data to classify injury counts. Regression is also prone to large calibration errors ( $>0.15$ ).

Another helpful tool is the reliability diagrams which visualize the calibration errors at different confidence bins. As an illustration, Fig. 3 shows the diagram of the NT5-Clf model fine-tuned on NAVCO injury data, and the diagonal line indicates the perfect calibration. This model is over-confident, and we can observe large gaps when the model confidence is larger than $0.8$ .

6.3 Post-hoc Calibration

Since the models can be over-confident based on the above analysis, we see the necessity to calibrate models for victim count extraction. We use temperature scaling for classification and generation decoding, and isotonic regression for regression. The post-hoc calibrators use development data to minimize negative log-likelihood and are then applied to test sets to measure calibration errors. As a comparison, Fig. 3 (right) shows the calibrated results of the fine-tuned NT5-Clf model on NAVCO injury data. The calibration error (i.e., ECE) reduces from $0.33$ to $0.06$ . The errors of other calibrated models can be found in Tab. 4. In general, when the models have rather a large calibration error (e.g., $>0.3$ ), post-hoc calibration is more helpful and adjusts the models to a better-calibrated level.

7 Evaluating Robustness

Typically, conflict or disaster data is noisy and limited. This is making it challenging to train models on a large-scale, high-quality training set. For this reason, we need robust models that excel in few-shot and out-of-distribution settings.

Reduced Training Size.

We fine-tune the NT5-Gen, NT5-Reg, and NT5-Clf models on different-size portions of the training set. Specifically, we use $100\%$ , $50\%$ , $10\%$ , $5\%$ , $0.5\%$ and $0\%$ of the training data and as further discussed in § C.1. As expected, we find that the accuracy of all models drops when using a smaller training set. The NT5-Gen model reveals to be the most robust in keeping the Exact-Match metric above $0.6$ when being fine-tuned on only $5\%$ of the training data. The calibration error of the fine-tuned NT5-Clf model increases when the training size is reduced, while the fine-tuned NT5-Reg and NT5-Gen models do not follow this trend. In the zero-shot setting, the NT5-Reg and NT5-Gen models reach their largest calibration error. In contrast, the NT5-Clf model reaches its smallest calibration error in the zero-shot setting.

Out-of-distribution (OOD) Setting.

We set up synthetic tasks in which a fine-tuned model is confronted with an out-of-distribution setting at test time. For example, we fine-tune a model on WAD death and then repurpose it to classify WAD injury. Then, we evaluate the drop in performance of this “out-of-distribution” model compared to an “in-distribution” model, that has been trained on WAD injury labels directly. We conduct this comparison on different datasets and models.

In § C.2, we evaluate the NT5-Clf model in a classification formulation and report accuracy. As expected, we find that accuracy decreases in every setting with performance drops between $0.001\%$ and $0.3\%$ . In Fig. 15, we evaluate the NT5-Reg model in a regression setting measured in MSE. We find that the performance decreases in the out-of-distribution settings as evidenced by an average increase of $1.12$ in MSE. Finally, in Fig. 16, we turn to an NT5-Gen model in a generative setting. As an evaluation metric, we consider Exact-Match and observe a decrease of $0.18$ in Exact-Match on average.

8 Application: Overlooked Victim Types

Most event datasets feature only one column detailing victim counts. This column typically quantifies fatalities, as they are considered least ambivalent and most important Kalyvas (2006); Chaudoin et al. (2017). The Armed Conflict Location & Event Data Project (ACLED) Raleigh et al. (2010); Raleigh and Kishi (2019) recently published curated datasets containing violence against healthcare workers, media personnel, and women. Considering the ACLED dataset on Political Violence Targeting Women & Demonstrations Featuring Women, we find that more than $85$ % of events have zero fatalities. This means, many forms of violence remain non-quantified, often those against “marginalized” groups of society.

Using the methods presented in this work, we can extract much more fine-grained victim types such as “injured women” and “abducted women”. To this end, we rely on the NT5-Gen model that we fine-tuned on the NAVCO data, without specifically asking for “women”. In Fig. 4, we present exemplary two-month time series of events in Syria. We find that our model has a higher recall than precision on the ground truth annotations for fatality counts. This may be desirable since we would like to avoid overlooking true victim counts.

9 Discussion

This work surveys different task formulations of victim count extraction and inspects desiderata like accuracy, reliability, and robustness of different models. We now summarize our findings and conclude which approach performs best under which circumstances (Tab. 5).

Some of the parsing-based approaches have the advantage of requiring no ground truth annotations of the extracted victim counts. This means, there is no need for training, but instead, a manually curated list of patterns and rules has to be assembled. The regex approach, for instance, has minimum requirements regarding hardware, but writing regex patterns is very time-intensive and can be prone to mistakes. Overall, the baseline models shine when it comes to speed, and they perform reasonable when victim counts are explicitly mentioned. Yet they fail at complex reasoning. For instance, when asking for the count of deaths in “one child and four women lost their lives”, all baselines mistakenly output “ $1$ ”.

This is where language model-based methods have a competitive edge. The fine-tuned NT5-Gen model has high accuracy both in Exact-Match metric and relative error metric. Surprisingly, it is also well-calibrated and relatively robust in the few-shot and out-of-distribution setting. This performance comes at the costs of reduced speed, the requirement of large amounts of training data, and the need for resources like GPUs to be deployed on a large scale.

Comparing classification and regression objectives, we conclude that classification is easier to handle. In most settings, it may be sufficient to extract a range rather than an exact number anyways. In comparison to generation, in classification and regression settings, models show higher calibration errors and require post-hoc calibration to adjust the model confidence.

10 Related Works

This work interfaces with related works from different disciplines to improve the measurement of crisis intensity. It draws inspiration from recent advancements in question answering models with a focus on numbers and math word problems. This includes number-enhanced language models more generally. Our work also connects with model calibration in natural language processing (NLP) more generally.

Measurement of Crisis Intensity.

Extracting information about crises has been widely explored using social media data Temnikova et al. (2015) and newspapers Keith et al. (2017); Halterman et al. (2021). Most existing measures of crisis intensity focus on counts of event types Goldstein (1992); Terechshenko (2020); Stoehr et al. (2022) or fatality counts Kalyvas (2006). Previous work studies friend-enemy relationships Han et al. (2019); Russo et al. (2022a); Stoehr et al. (2021, 2023) and conflict-indicative changes in word embeddings Kutuzov et al. (2017).

Numerical Question Answering.

Numerical Question Answering pertains to the task of providing numeric answers to questions. An exemplary model is NAQANet Dua et al. (2019), which extends QANet (Yu et al., 2018) with numerical operations. Neural Module Networks (Gupta et al., 2020) learn and execute a chain of logical learnable and differentiable modules. Some of these modules are specifically targeted at mathematical operations such find-num or count. Other approaches leverage knowledge graphs Davidov and Rappoport (2010); Kotnis and García-Durán (2019) or graph neural networks Chen et al. (2020). Thawani et al. (2021) provides a detailed overview over methods for representing and modeling numbers in NLP.

Number-enhanced Language Models.

More recent work in number question answering relies on pre-trained large language models. GenBERT Geva et al. (2020) improves numeric reasoning abilities by including a large amount of synthetic data containing numbers. Codex Chen et al. (2021) and NT5 Yang et al. (2021) apply similar strategies and are trained on code and math word problems. Other approaches focus on step-by-step reasoning such as Minerva Lewkowycz et al. (2022), scratchpad Nye et al. (2021) and chain-of-thought prompting Wei et al. (2022). Lefebvre and Stoehr (2022) propose a prompting-based method particularly for conflict event classification.

Calibration of NLP Models.

The calibration of NLP models has been extensively studied in classification Guo et al. (2017) and structured prediction tasks (Kuleshov and Liang, 2015; Nguyen and O’Connor, 2015). Calibration methods have been adapted in language modeling (Braverman et al., 2020; Kong et al., 2020), question answering (Kamath et al., 2020; Jiang et al., 2021), and machine translation (Kumar and Sarawagi, 2019; Wang et al., 2020).

11 Conclusion

We presented victim count extraction, a challenging and impactful task. The task can be tackled using different formulations and models. Models should be evaluated along different dimensions such as accuracy, reliability, and robustness. We survey this ambiguity of victim count extraction, identify promising directions, and discuss outlooks and applications.

Acknowledgments

We would like to thank and acknowledge ideas, input, support and feedback from Leonie Muggenthaler, Ryan Cotterell as well as the anonymous reviewers. Niklas Stoehr is supported by a scholarship from the Swiss Data Science Center (SDSC).

Limitations

The models may be biased or reproduce biases inherent in their training data. Presenting unrelated, faulty or immoral questions to a model can cause unguided and malicious behavior. For example, we caution of asking questions such as “How many people will be injured…?”; and even worse “How many people should be injured…?”. Improving model calibration will help defending against these issues and enable awareness of when to abstain from answering.

Ethics Statement

This work originated from the motivation to diversify victim count extraction towards underrepresented victim types and overlooked forms of violence. This work ultimately intends to assist researchers and analysts in the sector of humanitarian aid who are in demand of accurate victim count information.

Appendix A Regex Patterns

We convert any non-digitized numeral expressions into a digitized format (e.g. twelve $\rightarrow$ 12). Regex patterns are designed for both passive and active voices. We also distinguish plural (“are” and “were”) and singular forms (“is”, “was”) for passive voice patterns. The algorithm checks with the following order: passive plural, passive single, and active. If multiple numbers are extracted, the first is kept. We list the regex patterns used to extract victim counts in Tab. 6, for death counts and injury counts respectively.

Appendix B Accuracy Evaluation

In this section, we complement the accuracy evaluation of the models in § 5.

B.1 Exact-Match and $F_{1}$ score on Death Counts

The Exact-Match and $F_{1}$ scores on extracting the death counts are shown in Tab. 7, which compares the performance of the baseline models and the fine-tuned NT5-Gen model. Similar to the results on the injury counts Tab. 1, the fine-tuned NT5-Gen model performs better than all baselines and the SRL has the best accuracy among baselines.

B.2 Confusion Matrix on Death Counts

Similar to Fig. 1 shown in § 5.1, Fig. 5 plots the confusion matrices of the binned death counts for the different datasets, which compare the accuracy of the baseline models with the fine-tuned NT5-Gen model.

B.3 Results on Classification and Regression

In § 5.2, we have shown the results of the NT5-Clf model and the NT5-Reg model fine-tuned on NAVCO injury counts in Tab. 3.

Here, we use the same metrics and display the classification performance on other datasets. In specific, Tab. 8, Tab. 9, Tab. 10, Tab. 11, and Tab. 12 respectively show the classification performance of the NT5-Clf model fine-tuned on WAD death counts, WAD injury counts, NAVCO death counts, EMM death counts, and EMM injury counts.

Similarly, we provide the scatter plots of the fine-tuned NT5-Reg models initialized with different pre-trained weights in this section: WAD death counts (Fig. 6), WAD injury counts (Fig. 7), NAVCO death counts (Fig. 8), EMM death counts (Fig. 9), and EMM injury counts (Fig. 10).

Appendix C Robustness Evaluation

In this section, we provide the detailed performance of the few-shot setting (§ C.1) and the out-of-distribution setting (§ C.2) discussed in § 7.

C.1 Few-shot Performance

We display the results of the few-shot settings where different proportions of the training set are used to fine-tune the models. For each formulation, the left figure is the variation of the accuracy metrics and the right figure is the variation of the calibration error. Fig. 11, Fig. 12, and Fig. 13 are performance of the few-shot settings of the fine-tuned NT5-Clf, NT5-Reg, and NT5-Gen models respectively.

With respect to accuracy metrics, the classification accuracy and the $F_{1}$ score is plotted for the fine-tuned NT5-Clf model.

For the regression, we plot the change of mean squared error on the $\log$ transformed counts. In addition, we plot the pinball losses using two targeting quantile (at 10% and at 90%).

Lastly, the Exact-Match and the $F_{1}$ scores are drawn for the fine-tuned NT5-Gen model.

C.2 Out-of-distribution Setting

For each task formulation, we examine the accuracy performance in the out-of-distribution setting for the fine-tuned NT5-Clf (Fig. 14), NT5-Reg (Fig. 15), and NT5-Gen (Fig. 16). For all plots, the x-axis is the accuracy metric used in each task formulation, and the y-axis indicates the test set to be made inferences on. The red bar indicates the performance of in-distribution performance, e.g., accuracy of WAD death test data using the model fine-tuned on WAD death.

With respect to the accuracy metric, different formulations use their corresponding metric. For the classification setting, we show the variation in classification accuracy. For the regression setting, we show the variation in mean squared errors. For the generation setting, we show the change in Exact-Match scores.

Bibliography68

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Braverman et al. (2020) Mark Braverman, Xinyi Chen, Sham Kakade, Karthik Narasimhan, Cyril Zhang, and Yi Zhang. 2020. Calibration, entropy rates, and memory in language models . In Proceedings of the 37th International Conference on Machine Learning , volume 119 of Proceedings of Machine Learning Research , pages 1089–1099. PMLR.
2Camilleri et al. (2019) Stephen Camilleri, Joel Azzopardi, and Matthew R. Agius. 2019. Investigating the relationship between earthquakes and online news . In 2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE) , page 203–210. · doi ↗
3Chaudoin et al. (2017) Stephen Chaudoin, Zachary Peskowitz, and Christopher Stanton. 2017. Beyond zeroes and ones: The intensity and dynamics of civil conflict . The Journal of Conflict Resolution , 61(1):56–83. Publisher: Sage Publications, Inc.
4Chen et al. (2020) Kunlong Chen, Weidi Xu, Xingyi Cheng, Zou Xiaochuan, Yuyu Zhang, Le Song, Taifeng Wang, Yuan Qi, and Wei Chu. 2020. Question directed graph attention network for numerical reasoning over text . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 6759–6768, Online. Association for Computational Linguistics. · doi ↗
5Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Bar · doi ↗
6Cruyff et al. (2017) Maarten Cruyff, Jan van Dijk, and Peter G. M. van der Heijden. 2017. The challenge of counting victims of human trafficking: Not on the record: A multiple systems estimation of the numbers of human trafficking victims in the Netherlands in 2010–2015 by year, age, gender, and type of exploitation . CHANCE , 30(3):41–49. · doi ↗
7Cullen et al. (2021) Patricia Cullen, Myrna Dawson, Jenna Price, and James Rowlands. 2021. Intersectionality and invisible victims: Reflections on data challenges and vicarious trauma in femicide, family and intimate partner homicide research . Journal of Family Violence , 36(5):619–628. · doi ↗
8Darcy and Hofmann (2003) James Darcy and Charles-Antoine Hofmann. 2003. According to need? Needs assessment and decision-making in the humanitarian sector . Technical report, Overseas Development Institute.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Extracting Victim Counts from Text

Abstract

1 Introduction

2 Data

3 Task Formulation

Is the victim count always present in the text?

Is the victim count always a single number?

4 Models

4.1 Baseline Models

Regex.

Dependency Parsing.

SRL.

4.2 Task Modeling

Generation.

Regression.

Classification.

5 Accuracy of Counts Extraction

5.1 Comparing Baselines with NT5-Gen

Qualitative Analysis.

5.2 Results on Classification and Regression

6 Evaluating Reliability

6.1 Preliminaries: Calibration Metrics

6.2 Calibration Error on Different Models

6.3 Post-hoc Calibration

7 Evaluating Robustness

Reduced Training Size.

Out-of-distribution (OOD) Setting.

8 Application: Overlooked Victim Types

9 Discussion

10 Related Works

Measurement of Crisis Intensity.

Numerical Question Answering.

Number-enhanced Language Models.

Calibration of NLP Models.

11 Conclusion

Acknowledgments

Limitations

Ethics Statement

Appendix A Regex Patterns

Appendix B Accuracy Evaluation

B.1 Exact-Match and F1F_{1}F1​ score on Death Counts

B.2 Confusion Matrix on Death Counts

B.3 Results on Classification and Regression

Appendix C Robustness Evaluation

C.1 Few-shot Performance

C.2 Out-of-distribution Setting

B.1 Exact-Match and $F_{1}$ score on Death Counts