Learning to Select, Track, and Generate for Data-to-Text

Hayate Iso; Yui Uehara; Tatsuya Ishigaki; Hiroshi Noji; Eiji Aramaki,; Ichiro Kobayashi; Yusuke Miyao; Naoaki Okazaki; Hiroya Takamura

arXiv:1907.09699·cs.CL·April 5, 2021

Learning to Select, Track, and Generate for Data-to-Text

Hayate Iso, Yui Uehara, Tatsuya Ishigaki, Hiroshi Noji, Eiji Aramaki,, Ichiro Kobayashi, Yusuke Miyao, Naoaki Okazaki, Hiroya Takamura

PDF

2 Repos

TL;DR

This paper introduces a novel data-to-text generation model with tracking and generation modules that emulate human writing, improving summary quality by effectively selecting and organizing information.

Contribution

The paper presents a new model with separate tracking and generation modules, demonstrating improved performance over existing methods and exploring the role of writer information.

Findings

01

Outperforms existing models on all evaluation metrics.

02

Incorporating writer information enhances generation quality.

03

Effective information tracking improves content planning.

Abstract

We propose a data-to-text generation model with two modules, one for tracking and the other for text generation. Our tracking module selects and keeps track of salient information and memorizes which record has been mentioned. Our generation module generates a summary conditioned on the state of tracking module. Our model is considered to simulate the human-like writing process that gradually selects the information by determining the intermediate variables while writing the summary. In addition, we also explore the effectiveness of the writer information for generation. Experimental results show that our model outperforms existing models in all evaluation metrics even without writer information. Incorporating writer information further improves the performance, contributing to content planning and surface realization.

Tables5

Table 1. (a) Box score: Top contingency table shows number of wins and losses and summary of each game. Bottom table shows statistics of each player such as points scored ( Player ’s Pts ), and total rebounds ( Player ’s Reb ).

Team	H/V	Win	Loss	Pts	Reb	Ast	Fg_Pct	Fg3_Pct	$\dots$
Knicks	H	16	19	104	46	26	45	46	$\dots$
Bucks	V	18	16	105	42	20	47	32	$\dots$

Table 2. (a) Box score: Top contingency table shows number of wins and losses and summary of each game. Bottom table shows statistics of each player such as points scored ( Player ’s Pts ), and total rebounds ( Player ’s Reb ).

Team	H/V	Win	Loss	Pts	Reb	Ast	Fg_Pct	Fg3_Pct	$\dots$
Knicks	H	16	19	104	46	26	45	46	$\dots$
Bucks	V	18	16	105	42	20	47	32	$\dots$

Table 3. Table 2: Running example of our model’s generation process. At every time step t 𝑡 t , model predicts each random variable. Model firstly determines whether to refer to data records ( Z t = 1 subscript 𝑍 𝑡 1 Z_{t}=1 ) or not ( Z t = 0 subscript 𝑍 𝑡 0 Z_{t}=0 ). If random variable Z t = 1 subscript 𝑍 𝑡 1 Z_{t}=1 , model selects entity E t subscript 𝐸 𝑡 E_{t} , its attribute A t subscript 𝐴 𝑡 A_{t} and binary variables N t subscript 𝑁 𝑡 N_{t} if needed. For example, at t = 202 𝑡 202 t=202 , model predicts random variable Z 202 = 1 subscript 𝑍 202 1 Z_{202}=1 and then selects the entity Jabari Parker and its attribute Player Pts . Given these values, model outputs token 𝟏𝟓 15 \mathbf{15} from selected data record.

$t$	199	200	201	202	203	204	205	206	207	208	209
$Y_{t}$	Jabari	Parker	contributed	15	points	,	four	rebounds	,	three	assists
$Z_{t}$	1	1	0	1	0	0	1	0	0	1	0
$E_{t}$	Jabari	Jabari	-	Jabari	-	-	Jabari	-	-	Jabari	-
$E_{t}$	Parker	Parker	-	Parker	-	-	Parker	-	-	Parker	-
$A_{t}$	First Name	Last Name	-	Player Pts	-	-	Player Reb	-	-	Player Ast	-
$N_{t}$	-	-	-	0	-	-	1	-	-	1	-

Table 4. Table 3: Experimental result. Each metric evaluates whether important information (CS) is described accurately (RG) and in correct order (CO).

Method	RG		CS			CO	Bleu
Method	#	P%	P%	R%	F1%	DLD%	Bleu
Gold	27.36	93.42	100.	100.	100.	100.	100.
Templates	54.63	100.	31.01	58.85	40.61	17.50	8.43
Wiseman et al. (2017)	22.93	60.14	24.24	31.20	27.29	14.70	14.73
Puduppully et al. (2019)	33.06	83.17	33.06	43.59	37.60	16.97	13.96
Proposed	39.05	94.43	35.77	52.05	42.40	19.38	16.15

Table 5. Table 4: Effects of writer information. 𝒘 𝒘 \boldsymbol{w} indicates that Writer embeddings are used. Numbers in bold are the largest among the variants of each method.

Method	RG		CS			CO	Bleu
Method	#	P%	P%	R%	F1%	DLD%	Bleu
Puduppully et al. (2019)	33.06	83.17	33.06	43.59	37.60	16.97	13.96
+ $𝒘$ in stage 1	28.43	84.75	45.00	49.73	47.25	22.16	18.18
+ $𝒘$ in stage 2	35.06	80.51	31.10	45.28	36.87	16.38	17.81
+ $𝒘$ in stage 1 & 2	28.00	82.27	44.37	48.71	46.44	22.41	18.90
Proposed	39.05	94.38	35.77	52.05	42.40	19.38	16.15
+ $𝒘$	30.25	92.00	50.75	59.03	54.58	25.75	20.84

Equations33

r_{e, a, v} = tanh (W^{\textsc R} (e \oplus a \oplus v)),

r_{e, a, v} = tanh (W^{\textsc R} (e \oplus a \oplus v)),

\overset{ˉ}{e} = tanh (a \in A \sum W_{a}^{\textsc A} r_{e, a, x [e, a]}),

\overset{ˉ}{e} = tanh (a \in A \sum W_{a}^{\textsc A} r_{e, a, x [e, a]}),

p (Z_{t} = 1 ∣ h_{t - 1}^{\textsc L M}, h_{t - 1}^{\textsc E n t}) = σ (W_{z} (h_{t - 1}^{\textsc L M} \oplus h_{t - 1}^{\textsc E n t})),

p (Z_{t} = 1 ∣ h_{t - 1}^{\textsc L M}, h_{t - 1}^{\textsc E n t}) = σ (W_{z} (h_{t - 1}^{\textsc L M} \oplus h_{t - 1}^{\textsc E n t})),

p (E_{t} = e ∣ h_{t - 1}^{\textsc L M}, h_{t - 1}^{\textsc E n t})

p (E_{t} = e ∣ h_{t - 1}^{\textsc L M}, h_{t - 1}^{\textsc E n t})

\propto

h_{t}^{\textsc E n t^{'}} = ⎩ ⎨ ⎧ h_{t - 1}^{\textsc E n t} if e_{t} = e_{t - 1} \textsc G r u^{\textsc E} (\overset{ˉ}{e}, h_{t - 1}^{\textsc E n t}) else if e_{t} \neq \in E_{t - 1} \textsc G r u^{\textsc E} (W_{s}^{\textsc S} h_{s}^{\textsc E n t}, h_{t - 1}^{\textsc E n t}) otherwise.

h_{t}^{\textsc E n t^{'}} = ⎩ ⎨ ⎧ h_{t - 1}^{\textsc E n t} if e_{t} = e_{t - 1} \textsc G r u^{\textsc E} (\overset{ˉ}{e}, h_{t - 1}^{\textsc E n t}) else if e_{t} \neq \in E_{t - 1} \textsc G r u^{\textsc E} (W_{s}^{\textsc S} h_{s}^{\textsc E n t}, h_{t - 1}^{\textsc E n t}) otherwise.

p (A_{t} = a ∣ e_{t}, h_{t - 1}^{\textsc L M}, h_{t}^{\textsc E n t^{'}})

p (A_{t} = a ∣ e_{t}, h_{t - 1}^{\textsc L M}, h_{t}^{\textsc E n t^{'}})

\propto

h_{t}^{\textsc E n t} = \textsc G r u^{\textsc A} (r_{e_{t}, a_{t}, x [e_{t}, a_{t}]}, h_{t}^{\textsc E n t^{'}}) .

h_{t}^{\textsc E n t} = \textsc G r u^{\textsc A} (r_{e_{t}, a_{t}, x [e_{t}, a_{t}]}, h_{t}^{\textsc E n t^{'}}) .

p (N_{t} = 1 ∣ h_{t - 1}^{\textsc L M}, h_{t}^{\textsc E n t}) = σ (W^{\textsc N} (h_{t - 1}^{\textsc L M} \oplus h_{t}^{\textsc E n t})),

p (N_{t} = 1 ∣ h_{t - 1}^{\textsc L M}, h_{t}^{\textsc E n t}) = σ (W^{\textsc N} (h_{t - 1}^{\textsc L M} \oplus h_{t}^{\textsc E n t})),

h_{t}^{'}

h_{t}^{'}

p (Y_{t} ∣ h_{t}^{'}) = softmax (W^{\textsc Y} h_{t}^{'}) .

p (Y_{t} ∣ h_{t}^{'}) = softmax (W^{\textsc Y} h_{t}^{'}) .

h_{t}^{\textsc L M}

h_{t}^{\textsc L M}

h_{t}^{'}

h_{t}^{'}

lo g p (Y_{1 : T}, Z_{1 : T}, E_{1 : T}, A_{1 : T}, N_{1 : T} ∣ x)

lo g p (Y_{1 : T}, Z_{1 : T}, E_{1 : T}, A_{1 : T}, N_{1 : T} ∣ x)

=

+

+

+

+

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Learning to Select, Track, and Generate for Data-to-Text

Hayate Iso*†* Yui Uehara*‡* Tatsuya Ishigaki*♮‡* Hiroshi Noji*‡*

** Eiji Aramaki*†‡* Ichiro Kobayashi*♭‡* Yusuke Miyao*♯‡* Naoaki Okazaki*♮‡* Hiroya Takamura*♮‡*

*†*Nara Institute of Science and Technology *‡*Artificial Intelligence Research Center, AIST

♮Tokyo Institute of Technology ♭Ochanomizu University ♯The University of Tokyo

**{iso.hayate.id3,aramaki}@is.naist.jp [email protected]

{yui.uehara,ishigaki.t,hiroshi.noji,takamura.hiroya}@aist.go.jp

[email protected] [email protected] ** ** Work was done during the internship at Artificial Intelligence Research Center, AIST

Abstract

We propose a data-to-text generation model with two modules, one for tracking and the other for text generation. Our tracking module selects and keeps track of salient information and memorizes which record has been mentioned. Our generation module generates a summary conditioned on the state of tracking module. Our model is considered to simulate the human-like writing process that gradually selects the information by determining the intermediate variables while writing the summary. In addition, we also explore the effectiveness of the writer information for generation. Experimental results show that our model outperforms existing models in all evaluation metrics even without writer information. Incorporating writer information further improves the performance, contributing to content planning and surface realization.

1 Introduction

Advances in sensor and data storage technologies have rapidly increased the amount of data produced in various fields such as weather, finance, and sports. In order to address the information overload caused by the massive data, data-to-text generation technology, which expresses the contents of data in natural language, becomes more important Barzilay and Lapata (2005). Recently, neural methods can generate high-quality short summaries especially from small pieces of data Liu et al. (2018).

Despite this success, it remains challenging to generate a high-quality long summary from data Wiseman et al. (2017). One reason for the difficulty is because the input data is too large for a naive model to find its salient part, i.e., to determine which part of the data should be mentioned. In addition, the salient part moves as the summary explains the data. For example, when generating a summary of a basketball game (Table 1 (b)) from the box score (Table 1 (a)), the input contains numerous data records about the game: e.g., Jordan Clarkson scored 18 points. Existing models often refer to the same data record multiple times Puduppully et al. (2019). The models may mention an incorrect data record, e.g., Kawhi Leonard added 19 points: the summary should mention LaMarcus Aldridge, who scored 19 points. Thus, we need a model that finds salient parts, tracks transitions of salient parts, and expresses information faithful to the input.

In this paper, we propose a novel data-to-text generation model with two modules, one for saliency tracking and another for text generation. The tracking module keeps track of saliency in the input data: when the module detects a saliency transition, the tracking module selects a new data record111We use ‘data record’ and ‘relation’ interchangeably. and updates the state of the tracking module. The text generation module generates a document conditioned on the current tracking state. Our model is considered to imitate the human-like writing process that gradually selects and tracks the data while generating the summary. In addition, we note some writer-specific patterns and characteristics: how data records are selected to be mentioned; and how data records are expressed as text, e.g., the order of data records and the word usages. We also incorporate writer information into our model.

The experimental results demonstrate that, even without writer information, our model achieves the best performance among the previous models in all evaluation metrics: 94.38% precision of relation generation, 42.40% F1 score of content selection, 19.38% normalized Damerau-Levenshtein Distance (DLD) of content ordering, and 16.15% of BLEU score. We also confirm that adding writer information further improves the performance.

2 Related Work

2.1 Data-to-Text Generation

Data-to-text generation is a task for generating descriptions from structured or non-structured data including sports commentary Tanaka-Ishii et al. (1998); Chen and Mooney (2008); Taniguchi et al. (2019), weather forecast Liang et al. (2009); Mei et al. (2016), biographical text from infobox in Wikipedia Lebret et al. (2016); Sha et al. (2018); Liu et al. (2018) and market comments from stock prices Murakami et al. (2017); Aoki et al. (2018).

Neural generation methods have become the mainstream approach for data-to-text generation. The encoder-decoder framework Cho et al. (2014); Sutskever et al. (2014) with the attention Bahdanau et al. (2015); Luong et al. (2015) and copy mechanism Gu et al. (2016); Gulcehre et al. (2016) has successfully applied to data-to-text tasks. However, neural generation methods sometimes yield fluent but inadequate descriptions Tu et al. (2017). In data-to-text generation, descriptions inconsistent to the input data are problematic.

Recently, Wiseman et al. (2017) introduced the RotoWire dataset, which contains multi-sentence summaries of basketball games with box-score (Table 1). This dataset requires the selection of a salient subset of data records for generating descriptions. They also proposed automatic evaluation metrics for measuring the informativeness of generated summaries.

Puduppully et al. (2019) proposed a two-stage method that first predicts the sequence of data records to be mentioned and then generates a summary conditioned on the predicted sequences. Their idea is similar to ours in that the both consider a sequence of data records as content planning. However, our proposal differs from theirs in that ours uses a recurrent neural network for saliency tracking, and that our decoder dynamically chooses a data record to be mentioned without fixing a sequence of data records.

2.2 Memory modules

The memory network can be used to maintain and update representations of the salient information Weston et al. (2015); Sukhbaatar et al. (2015); Graves et al. (2016). This module is often used in natural language understanding to keep track of the entity state Kobayashi et al. (2016); Hoang et al. (2018); Bosselut et al. (2018).

Recently, entity tracking has been popular for generating coherent text Kiddon et al. (2016); Ji et al. (2017); Yang et al. (2017); Clark et al. (2018). Kiddon et al. (2016) proposed a neural checklist model that updates predefined item states. Ji et al. (2017) proposed an entity representation for the language model. Updating entity tracking states when the entity is introduced, their method selects the salient entity state.

Our model extends this entity tracking module for data-to-text generation tasks. The entity tracking module selects the salient entity and appropriate attribute in each timestep, updates their states, and generates coherent summaries from the selected data record.

3 Data

Through careful examination, we found that in the original dataset RotoWire, some NBA games have two documents, one of which is sometimes in the training data and the other is in the test or validation data. Such documents are similar to each other, though not identical. To make this dataset more reliable as an experimental dataset, we created a new version.

We ran the script provided by Wiseman et al. (2017), which is for crawling the RotoWire website for NBA game summaries. The script collected approximately 78% of the documents in the original dataset; the remaining documents disappeared. We also collected the box-scores associated with the collected documents. We observed that some of the box-scores were modified compared with the original RotoWire dataset.

The collected dataset contains 3,752 instances (i.e., pairs of a document and box-scores). However, the four shortest documents were not summaries; they were, for example, an announcement about the postponement of a match. We thus deleted these 4 instances and were left with 3,748 instances. We followed the dataset split by Wiseman et al. (2017) to split our dataset into training, development, and test data. We found 14 instances that didn’t have corresponding instances in the original data. We randomly classified 9, 2, and 3 of those 14 instances respectively into training, development, and test data. Finally, the sizes of our training, development, test dataset are respectively 2,714, 534, and 500. On average, each summary has 384 tokens and 644 data records. Each match has only one summary in our dataset, as far as we checked. We also collected the writer of each document. Our dataset contains 32 different writers. The most prolific writer in our dataset wrote 607 documents. There are also writers who wrote less than ten documents. On average, each writer wrote 117 documents. We call our new dataset RotoWire-Modified.222For information about the dataset, please follow this link: https://github.com/aistairc/rotowire-modified

4 Saliency-Aware Text Generation

At the core of our model is a neural language model with a memory state $\boldsymbol{h}^{\textsc{LM}}$ to generate a summary $y_{1:T}=(y_{1},\dots,y_{T})$ given a set of data records $\boldsymbol{x}$ . Our model has another memory state $\boldsymbol{h}^{\textsc{Ent}}$ , which is used to remember the data records that have been referred to. $\boldsymbol{h}^{\textsc{Ent}}$ is also used to update $\boldsymbol{h}^{\textsc{LM}}$ , meaning that the referred data records affect the text generation.

Our model decides whether to refer to $\boldsymbol{x}$ , which data record $r\in\boldsymbol{x}$ to be mentioned, and how to express a number. The selected data record is used to update $\boldsymbol{h}^{\textsc{Ent}}$ . Formally, we use the four variables:

$Z_{t}$ : binary variable that determines whether the model refers to input $\boldsymbol{x}$ at time step $t$ ( $Z_{t}=1$ ). 2. 2.

$E_{t}$ : At each time step $t$ , this variable indicates the salient entity (e.g., Hawks, LeBron James). 3. 3.

$A_{t}$ : At each time step $t$ , this variable indicates the salient attribute to be mentioned (e.g., Pts). 4. 4.

$N_{t}$ : If attribute $A_{t}$ of the salient entity $E_{t}$ is a numeric attribute, this variable determines if a value in the data records should be output in Arabic numerals (e.g., 50) or in English words (e.g., five).

To keep track of the salient entity, our model predicts these random variables at each time step $t$ through its summary generation process. Running example of our model is shown in Table 2 and full algorithm is described in Appendix A. In the following subsections, we explain how to initialize the model, predict these random variables, and generate a summary. Due to space limitations, bias vectors are omitted.

Before explaining our method, we describe our notation. Let $\mathcal{E}$ and $\mathcal{A}$ denote the sets of entities and attributes, respectively. Each record $r\in\boldsymbol{x}$ consists of entity $e\in\mathcal{E}$ , attribute $a\in\mathcal{A}$ , and its value $\boldsymbol{x}[e,a]$ , and is therefore represented as $r=(e,a,\boldsymbol{x}[e,a])$ . For example, the box-score in Table 1 has a record $r$ such that $e=\textsc{Anthony Davis},a=\textsc{Pts},$ and $\boldsymbol{x}[e,a]=20$ .

4.1 Initialization

Let $\boldsymbol{r}$ denote the embedding of data record $r\in\boldsymbol{x}$ . Let $\bar{\boldsymbol{e}}$ denote the embedding of entity $e$ . Note that $\bar{\boldsymbol{e}}$ depends on the set of data records, i.e., it depends on the game. We also use $\boldsymbol{e}$ for static embedding of entity $e$ , which, on the other hand, does not depend on the game.

Given the embedding of entity $\boldsymbol{e}$ , attribute $\boldsymbol{a}$ , and its value $\boldsymbol{v}$ , we use the concatenation layer to combine the information from these vectors to produce the embedding of each data record $(e,a,v)$ , denoted as $\boldsymbol{r}_{e,a,v}$ as follows:

[TABLE]

where $\oplus$ indicates the concatenation of vectors, and $\boldsymbol{W}^{\textsc{R}}$ denotes a weight matrix.333We also concatenate the embedding vectors that represents whether the entity is in home or away team.

We obtain $\bar{\boldsymbol{e}}$ in the set of data records $\boldsymbol{x}$ , by summing all the data-record embeddings transformed by a matrix:

[TABLE]

where $\boldsymbol{W}^{\textsc{A}}_{a}$ is a weight matrix for attribute $a$ . Since $\bar{\boldsymbol{e}}$ depends on the game as above, $\bar{\boldsymbol{e}}$ is supposed to represent how entity $e$ played in the game.

To initialize the hidden state of each module, we use embeddings of $<$ SoD $>$ for $\boldsymbol{h}^{\textsc{LM}}$ and averaged embeddings of $\bar{\boldsymbol{e}}$ for $\boldsymbol{h}^{\textsc{ENT}}$ .

4.2 Saliency transition

Generally, the saliency of text changes during text generation. In our work, we suppose that the saliency is represented as the entity and its attribute being talked about. We therefore propose a model that refers to a data record at each timepoint, and transitions to another as text goes.

To determine whether to transition to another data record or not at time $t$ , the model calculates the following probability:

[TABLE]

where $\sigma(\cdot)$ is the sigmoid function. If $p(Z_{t}=1\mid\boldsymbol{h}_{t-1}^{\textsc{LM}},\boldsymbol{h}_{t-1}^{\textsc{Ent}})$ is high, the model transitions to another data record.

When the model decides to transition to another, the model then determines which entity and attribute to refer to, and generates the next word (Section 4.3). On the other hand, if the model decides not transition to another, the model generates the next word without updating the tracking states $\boldsymbol{h}^{\textsc{Ent}}_{t}=\boldsymbol{h}^{\textsc{Ent}}_{t-1}$ (Section 4.4).

4.3 Selection and tracking

When the model refers to a new data record ( $Z_{t}=1$ ), it selects an entity and its attribute. It also tracks the saliency by putting the information about the selected entity and attribute into the memory vector $\boldsymbol{h}^{\textsc{Ent}}$ . The model begins to select the subject entity and update the memory states if the subject entity will change.

Specifically, the model first calculates the probability of selecting an entity:

[TABLE]

where $\mathcal{E}_{t-1}$ is the set of entities that have already been referred to by time step $t$ , and $s$ is defined as $s={\max\{s:s\leq t-1,e=e_{s}\}}$ , which indicates the time step when this entity was last mentioned.

The model selects the most probable entity as the next salient entity and updates the set of entities that appeared ( $\mathcal{E}_{t}=\mathcal{E}_{t-1}\cup\{e_{t}\}$ ).

If the salient entity changes $(e_{t}\not=e_{t-1})$ , the model updates the hidden state of the tracking model $\boldsymbol{h}^{\textsc{Ent}}$ with a recurrent neural network with a gated recurrent unit (Gru; Chung et al., 2014):

[TABLE]

Note that if the selected entity at time step $t$ , $e_{t}$ , is identical to the previously selected entity $e_{t-1}$ , the hidden state of the tracking model is not updated.

If the selected entity $e_{t}$ is new ( $e_{t}\not\in\mathcal{E}_{t-1}$ ), the hidden state of the tracking model is updated with the embedding $\bar{\boldsymbol{e}}$ of entity $e_{t}$ as input. In contrast, if entity $e_{t}$ has already appeared in the past ( $e_{t}\in\mathcal{E}_{t-1}$ ) but is not identical to the previous one $(e_{t}\not=e_{t-1})$ , we use $\boldsymbol{h}_{s}^{\textsc{Ent}}$ (i.e., the memory state when this entity last appeared) to fully exploit the local history of this entity.

Given the updated hidden state of the tracking model $\boldsymbol{h}_{t}^{\textsc{Ent}}$ , we next select the attribute of the salient entity by the following probability:

[TABLE]

After selecting $a_{t}$ , i.e., the most probable attribute of the salient entity, the tracking model updates the memory state $\boldsymbol{h}_{t}^{\textsc{Ent}}$ with the embedding of the data record $\boldsymbol{r}_{e_{t},a_{t},\boldsymbol{x}[e_{t},a_{t}]}$ introduced in Section 4.1:

[TABLE]

4.4 Summary generation

Given two hidden states, one for language model $\boldsymbol{h}_{t-1}^{\textsc{LM}}$ and the other for tracking model $\boldsymbol{h}_{t}^{\textsc{Ent}}$ , the model generates the next word $y_{t}$ . We also incorporate a copy mechanism that copies the value of the salient data record $\boldsymbol{x}[e_{t},a_{t}]$ .

If the model refers to a new data record ( $Z_{t}=1$ ), it directly copies the value of the data record $\boldsymbol{x}[e_{t},a_{t}]$ . However, the values of numerical attributes can be expressed in at least two different manners: Arabic numerals (e.g., 14) and English words (e.g., fourteen). We decide which one to use by the following probability:

[TABLE]

where $\boldsymbol{W}^{\textsc{N}}$ is a weight matrix. The model then updates the hidden states of the language model:

[TABLE]

where $\boldsymbol{W}^{\textsc{H}}$ is a weight matrix.

If the salient data record is the same as the previous one ( $Z_{t}=0$ ), it predicts the next word $y_{t}$ via a probability over words conditioned on the context vector $\boldsymbol{h}_{t}^{\prime}$ :

[TABLE]

Subsequently, the hidden state of language model $\boldsymbol{h}^{\textsc{LM}}$ is updated:

[TABLE]

where $\boldsymbol{y}_{t}$ is the embedding of the word generated at time step $t$ .444In our initial experiment, we observed a word repetition problem when the tracking model is not updated during generating each sentence. To avoid this problem, we also update the tracking model with special trainable vectors $\boldsymbol{v}_{\textsc{REFRESH}}$ to refresh these states after our model generates a period: $\boldsymbol{h}_{t}^{\textsc{Ent}}=\textsc{Gru}^{A}(\boldsymbol{v}_{\textsc{Refresh}},\boldsymbol{h}_{t}^{\textsc{Ent}})$

4.5 Incorporating writer information

We also incorporate the information about the writer of the summaries into our model. Specifically, instead of using Equation (9), we concatenate the embedding $\boldsymbol{w}$ of a writer to $\boldsymbol{h}_{t-1}^{\textsc{LM}}\oplus\boldsymbol{h}_{t}^{\textsc{Ent}}$ to construct context vector $\boldsymbol{h}_{t}^{\prime}$ :

[TABLE]

where $\boldsymbol{W}^{\prime\textsc{H}}$ is a new weight matrix. Since this new context vector $\boldsymbol{h}_{t}^{\prime}$ is used for calculating the probability over words in Equation (10), the writer information will directly affect word generation, which is regarded as surface realization in terms of traditional text generation. Simultaneously, context vector $\boldsymbol{h}_{t}^{\prime}$ enhanced with the writer information is used to obtain $\boldsymbol{h}_{t}^{\textsc{LM}}$ , which is the hidden state of the language model and is further used to select the salient entity and attribute, as mentioned in Sections 4.2 and 4.3. Therefore, in our model, the writer information affects both surface realization and content planning.

4.6 Learning objective

We apply fully supervised training that maximizes the following log-likelihood:

[TABLE]

5 Experiments

5.1 Experimental settings

We used RotoWire-Modified as the dataset for our experiments, which we explained in Section 3. The training, development, and test data respectively contained 2,714, 534, and 500 games.

Since we take a supervised training approach, we need the annotations of the random variables (i.e., $Z_{t}$ , $E_{t}$ , $A_{t}$ , and $N_{t}$ ) in the training data, as shown in Table 2. Instead of simple lexical matching with $r\in\boldsymbol{x}$ , which is prone to errors in the annotation, we use the information extraction system provided by Wiseman et al. (2017). Although this system is trained on noisy rule-based annotations, we conjecture that it is more robust to errors because it is trained to minimize the marginalized loss function for ambiguous relations. All training details are described in Appendix B.

5.2 Models to be compared

We compare our model555Our code is available from https://github.com/aistairc/sports-reporter against two baseline models. One is the model used by Wiseman et al. (2017), which generates a summary with an attention-based encoder-decoder model. The other baseline model is the one proposed by Puduppully et al. (2019), which first predicts the sequence of data records and then generates a summary conditioned on the predicted sequences. Wiseman et al. (2017)’s model refers to all data records every timestep, while Puduppully et al. (2019)’s model refers to a subset of all data records, which is predicted in the first stage. Unlike these models, our model uses one memory vector $\boldsymbol{h}^{\textsc{Ent}}_{t}$ that tracks the history of the data records, during generation. We retrained the baselines on our new dataset. We also present the performance of the Gold and Templates summaries. The Gold summary is exactly identical with the reference summary and each Templates summary is generated in the same manner as Wiseman et al. (2017).

In the latter half of our experiments, we examine the effect of adding information about writers. In addition to our model enhanced with writer information, we also add writer information to the model by Puduppully et al. (2019). Their method consists of two stages corresponding to content planning and surface realization. Therefore, by incorporating writer information to each of the two stages, we can clearly see which part of the model to which the writer information contributes to. For Puduppully et al. (2019) model, we attach the writer information in the following three ways:

concatenating writer embedding $\boldsymbol{w}$ with the input vector for LSTM in the content planning decoder (stage 1); 2. 2.

concatenating writer embedding $\boldsymbol{w}$ with the input vector for LSTM in the text generator (stage 2); 3. 3.

using both 1 and 2 above.

For more details about each decoding stage, readers can refer to Puduppully et al. (2019).

5.3 Evaluation metrics

As evaluation metrics, we use BLEU score Papineni et al. (2002) and the extractive metrics proposed by Wiseman et al. (2017), i.e., relation generation (RG), content selection (CS), and content ordering (CO) as evaluation metrics. The extractive metrics measure how well the relations extracted from the generated summary match the correct relations666The model for extracting relation tuples was trained on tuples made from the entity (e.g., team name, city name, player name) and attribute value (e.g., “Lakers”, “92”) extracted from the summaries, and the corresponding attributes (e.g., “Team Name”, “Pts”) found in the box- or line-score. The precision and the recall of this extraction model are respectively 93.4% and 75.0% in the test data.:

RG: the ratio of the correct relations out of all the extracted relations, where correct relations are relations found in the input data records $\boldsymbol{x}$ . The average number of extracted relations is also reported.

-

CS: precision and recall of the relations extracted from the generated summary against those from the reference summary.

-

CO: edit distance measured with normalized Damerau-Levenshtein Distance (DLD) between the sequences of relations extracted from the generated and reference summary.

6 Results and Discussions

We first focus on the quality of tracking model and entity representation in Sections 6.1 to 6.4, where we use the model without writer information. We examine the effect of writer information in Section 6.5.

6.1 Saliency tracking-based model

As shown in Table 3, our model outperforms all baselines across all evaluation metrics.777The scores of Puduppully et al. (2019)’s model significantly dropped from what they reported, especially on BLEU metric. We speculate this is mainly due to the reduced amount of our training data (Section 3). That is, their model might be more data-hungry than other models.

One of the noticeable results is that our model achieves slightly higher RG precision than the gold summary. Owing to the extractive evaluation nature, the generated summary of the precision of the relation generation could beat the gold summary performance. In fact, the template model achieves 100% precision of the relation generations.

The other is that only our model exceeds the template model regarding F1 score of the content selection and obtains the highest performance of content ordering. This imply that the tracking model encourages to select salient input records in the correct order.

6.2 Qualitative analysis of entity embedding

Our model has the entity embedding $\bar{\boldsymbol{e}}$ , which depends on the box score for each game in addition to static entity embedding $\boldsymbol{e}$ . Now we analyze the difference of these two types of embeddings.

We present a two-dimensional visualizations of both embeddings produced using PCA Pearson (1901). As shown in Figure 1, which is the visualization of static entity embedding $\boldsymbol{e}$ , the top-ranked players are closely located.

We also present the visualizations of dynamic entity embeddings $\bar{\boldsymbol{e}}$ in Figure 2. Although we did not carry out feature engineering specific to the NBA (e.g., whether a player scored double digits or not)888In the NBA, a player who accumulates a double-digit score in one of five categories (points, rebounds, assists, steals, and blocked shots) in a game, is regarded as a good player. If a player had a double in two of those five categories, it is referred to as double-double. for representing the dynamic entity embedding $\bar{\boldsymbol{e}}$ , the embeddings of the players who performed well for each game have similar representations. In addition, the change in embeddings of the same player was observed depending on the box-scores for each game. For instance, LeBron James recorded a double-double in a game on April 22, 2016. For this game, his embedding is located close to the embedding of Kevin Love, who also scored a double-double. However, he did not participate in the game on December 26, 2016. His embedding for this game became closer to those of other players who also did not participate.

6.3 Duplicate ratios of extracted relations

As Puduppully et al. (2019) pointed out, a generated summary may mention the same relation multiple times. Such duplicated relations are not favorable in terms of the brevity of text.

Figure 3 shows the ratios of the generated summaries with duplicate mentions of relations in the development data. While the models by Wiseman et al. (2017) and Puduppully et al. (2019) respectively showed 36.0% and 15.8% as duplicate ratios, our model exhibited 4.2%. This suggests that our model dramatically suppressed generation of redundant relations. We speculate that the tracking model successfully memorized which input records have been selected in $\boldsymbol{h}_{s}^{\textsc{Ent}}$ .

6.4 Qualitative analysis of output examples

Figure 5 shows the generated examples from validation inputs with Puduppully et al. (2019)’s model and our model. Whereas both generations seem to be fluent, the summary of Puduppully et al. (2019)’s model includes erroneous relations colored in orange.

Specifically, the description about Derrick Rose’s relations, “15 points, four assists, three rounds and one steal in 33 minutes.”, is also used for other entities (e.g., John Henson and Willy Hernagomez). This is because Puduppully et al. (2019)’s model has no tracking module unlike our model, which mitigates redundant references and therefore rarely contains erroneous relations.

However, when complicated expressions such as parallel structures are used our model also generates erroneous relations as illustrated by the underlined sentences describing the two players who scored the same points. For example, “11-point efforts” is correct for Courtney Lee but not for Derrick Rose. As a future study, it is necessary to develop a method that can handle such complicated relations.

6.5 Use of writer information

We first look at the results of an extension of Puduppully et al. (2019)’s model with writer information $\boldsymbol{w}$ in Table 4. By adding $\boldsymbol{w}$ to content planning (stage 1), the method obtained improvements in CS (37.60 to 47.25), CO (16.97 to 22.16), and BLEU score (13.96 to 18.18). By adding $\boldsymbol{w}$ to the component for surface realization (stage 2), the method obtained an improvement in BLEU score (13.96 to 17.81), while the effects on the other metrics were not very significant. By adding $\boldsymbol{w}$ to both stages, the method scored the highest BLEU, while the other metrics were not very different from those obtained by adding $\boldsymbol{w}$ to stage 1. This result suggests that writer information contributes to both content planning and surface realization when it is properly used, and improvements of content planning lead to much better performance in surface realization.

Our model showed improvements in most metrics and showed the best performance by incorporating writer information $\boldsymbol{w}$ . As discussed in Section 4.5, $\boldsymbol{w}$ is supposed to affect both content planning and surface realization. Our experimental result is consistent with the discussion.

7 Conclusion

In this research, we proposed a new data-to-text model that produces a summary text while tracking the salient information that imitates a human-writing process. As a result, our model outperformed the existing models in all evaluation measures. We also explored the effects of incorporating writer information to data-to-text models. With writer information, our model successfully generated highest quality summaries that scored 20.84 points of BLEU score.

Acknowledgments

We would like to thank the anonymous reviewers for their helpful suggestions. This paper is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO), JST PRESTO (Grant Number JPMJPR1655), and AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL).

Appendix A Algorithm

The generation process of our model is shown in Algorithm 1. For a concise description, we omit the condition for each probability notation. $<$ SoD $>$ and $<$ EoD $>$ represent “start of the document” and “end of the document”, respectively.

Appendix B Experimental settings

We set the dimensions of the embeddings to 128, and those of the hidden state of RNN to 512 and all of parameters are initialized with the Xavier initialization Glorot and Bengio (2010). We set the maximum number of epochs to 30, and choose the model with the highest Bleu score on the development data. The initial learning rate is 2e-3 and AMSGrad is also used for automatically adjusting the learning rate Reddi et al. (2018). Our implementation uses DyNet Neubig et al. (2017).

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aoki et al. (2018) Tatsuya Aoki, Akira Miyazawa, Tatsuya Ishigaki, Keiichi Goshima, Kasumi Aoki, Ichiro Kobayashi, Hiroya Takamura, and Yusuke Miyao. 2018. Generating Market Comments Referring to External Resources . In Proceedings of the 11th International Conference on Natural Language Generation , pages 135–139.
2Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate . In Proceedings of the Third International Conference on Learning Representations .
3Barzilay and Lapata (2005) Regina Barzilay and Mirella Lapata. 2005. Collective content selection for concept-to-text generation . In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing , pages 331–338.
4Bosselut et al. (2018) Antoine Bosselut, Omer Levy, Ari Holtzman, Corin Ennis, Dieter Fox, and Yejin Choi. 2018. Simulating Action Dynamics with Neural Process Networks . In Proceedings of the Sixth International Conference on Learning Representations .
5Chen and Mooney (2008) David L Chen and Raymond J Mooney. 2008. Learning to sportscast: a test of grounded language acquisition . In Proceedings of the 25th international conference on Machine learning , pages 128–135.
6Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing , pages 1724–1734.
7Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, Kyung Hyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. ar Xiv preprint ar Xiv:1412.3555 .
8Clark et al. (2018) Elizabeth Clark, Yangfeng Ji, and Noah A Smith. 2018. Neural Text Generation in Stories Using Entity Representations as Context . In Proceedings of the 16th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 2250–2260.