Enumeration of Extractive Oracle Summaries

Tsutomu Hirao; Masaaki Nishino; Jun Suzuki; Masaaki Nagata

arXiv:1701.01614·cs.CL·January 9, 2017

Enumeration of Extractive Oracle Summaries

Tsutomu Hirao, Masaaki Nishino, Jun Suzuki, Masaaki Nagata

PDF

Open Access

TL;DR

This paper introduces an ILP-based method to enumerate all extractive oracle summaries, revealing potential for improving summarization performance and better aligning automatic metrics with human judgment.

Contribution

It presents a novel ILP formulation and enumeration algorithm for extractive oracle summaries, enhancing analysis of summarization quality and evaluation.

Findings

01

Enumerated oracle summaries correlate better with human judgment.

02

Room for improvement in extractive summarization performance.

03

F-measures from enumeration outperform single oracle summaries.

Abstract

To analyze the limitations and the future directions of the extractive summarization paradigm, this paper proposes an Integer Linear Programming (ILP) formulation to obtain extractive oracle summaries in terms of ROUGE-N. We also propose an algorithm that enumerates all of the oracle summaries for a set of reference summaries to exploit F-measures that evaluate which system summaries contain how many sentences that are extracted as an oracle summary. Our experimental results obtained from Document Understanding Conference (DUC) corpora demonstrated the following: (1) room still exists to improve the performance of extractive summarization; (2) the F-measures derived from the enumerated oracle summaries have significantly stronger correlations with human judgment than those derived from single oracle summaries.

Tables7

Table 1. Table 1: Statistics of data set

Year	Topics	Docs.	Sents.	Words	Refs.	Length
01	30	10	365	7706	89	100
02	59	10	238	4822	116	100
03	30	10	245	5711	120	100
04	50	10	218	4870	200	100
05	50	29.5	885	18273.5	300	250
06	50	25	732.5	15997.5	200	250
07	45	25	516	11427	180	250

Table 2. Table 2: Rouge 1,2 scores of oracle summaries, greedy summaries, and system summaries for each data set

	01		02		03		04		05		06		07
	R₁	R₂	R₁	R₂	R₁	R₂	R₁	R₂	R₁	R₂	R₁	R₂	R₁	R₂
Oracle (multi)	.400	.164	.452	.186	.434	.185	.427	.162	.445	.177	.491	.211	.506	.236
Oracle (single)	.500	.226	.515	.225	.525	.258	.519	.228	.574	.279	.607	.303	.622	.330
Greedy	.387	.161	.438	.184	.424	.182	.412	.157	.430	.173	.473	.206	.495	.234
Peer	.251	.080	.269	.080	.295	.094	.305	.092	.262	.073	.305	.095	.363	.117
ID	T	T	19	19	26	13	67	65	10	15	23	24	29	15

Table 3. Table 3: Rouge 1 , 2 subscript Rouge 1 2 \text{\sc Rouge}_{1,2} scores for state-of-the-art summarization systems on DUC-2004 corpus

System	${Rouge}_{1}$	${Rouge}_{2}$
Oracle (multi)	.427	.162
Oracle (single)	.519	.228
CLASSY04	.305	.0897
CLASSY11	.286	.0919
Submodular	.300	.0933
DPP	.309	.0960
RegSum	.331	.0974
OCCAMS_V	.300	.0974
ICSISumm	.310	.0980

Table 4. Table 4: Jaccard Index between both oracle and greedy summaries

	single	multi
Rouge₁	.451	.419
Rouge₂	.536	.530

Table 5. Table 5: Median number of oracle summaries and rates of reference summaries and topics with multiple oracle summaries for each data set

	Median				Rate
	single		multi		single		multi
	Rouge₁	Rouge₂	Rouge₁	Rouge₂	Rouge₁	Rouge₂	Rouge₁	Rouge₂
01	8	9	4	5	.854	.787	.833	.733
02	7.5	5.5	4	4	.897	.836	.814	.780
03	8	10.5	3.5	4	.833	.858	.800	.900
04	8	8	3.5	3	.865	.865	.780	.760
05	35	35.5	2	3	.916	.907	.580	.660
06	28	22	2.5	3	.877	.880	.700	.720
07	23	16	4	2	.910	.878	.733	711

Table 6. Table 6: Correlation coefficients between automatic evaluations and human judgments on DUC-2004 corpus

Metric	$r$	$ρ$
${Rouge}_{1}$	.861	.760
${Rouge}_{2}$	.907	.831
F-measure (R₁) (single-M)	.857	.855
F-measure (R₁) (single-S)	.815-.830	.811-.830
F-measure (R₂) (single-M)	.904	.826
F-measure (R₂) (single-S)	.855-.865	.740-.760
F-measure (R₁) (multi-M)	.814	.841
F-measure (R₁) (multi-S)	.794-.802	.803-.813
F-measure (R₂) (multi-M)	.824	.846
F-measure (R₂) (multi-S)	.806-.816	.797-.817

Table 7. Table 7: Median number of summaries checked by each search method

	Rouge₁		Rouge₂
	Naive	Proposed	Naive	Proposed
01	3.66 $\times 10^{13}$	5.75 $\times 10^{3}$	3.32 $\times 10^{7}$	1.00 $\times 10^{3}$
02	1.12 $\times 10^{12}$	4.64 $\times 10^{3}$	1.34 $\times 10^{7}$	8.87 $\times 10^{2}$
03	1.62 $\times 10^{11}$	3.65 $\times 10^{3}$	6.37 $\times 10^{6}$	8.19 $\times 10^{2}$
04	9.65 $\times 10^{10}$	4.47 $\times 10^{3}$	6.90 $\times 10^{6}$	9.83 $\times 10^{2}$
05	5.48 $\times 10^{36}$	2.32 $\times 10^{6}$	3.48 $\times 10^{21}$	7.03 $\times 10^{4}$
06	1.94 $\times 10^{32}$	1.97 $\times 10^{6}$	2.11 $\times 10^{20}$	5.08 $\times 10^{4}$
07	4.14 $\times 10^{28}$	1.40 $\times 10^{6}$	1.81 $\times 10^{19}$	2.60 $\times 10^{4}$

Equations36

\sc Rouge_{n} (R, S) = \frac{k = 1 \sum ∣ R ∣ j = 1 \sum ∣ U ( R _{k} ) ∣ min { N ( g _{j}^{n} , R _{k} ) , N ( g _{j}^{n} , S )}}{k = 1 \sum ∣ R ∣ j = 1 \sum ∣ U ( R _{k} ) ∣ N ( g _{j}^{n} , R _{k} )} .

\sc Rouge_{n} (R, S) = \frac{k = 1 \sum ∣ R ∣ j = 1 \sum ∣ U ( R _{k} ) ∣ min { N ( g _{j}^{n} , R _{k} ) , N ( g _{j}^{n} , S )}}{k = 1 \sum ∣ R ∣ j = 1 \sum ∣ U ( R _{k} ) ∣ N ( g _{j}^{n} , R _{k} )} .

O = s . t . S \subseteq D arg max \sc Rouge_{n} (R, S) ℓ (S) \leq L_{max} .

O = s . t . S \subseteq D arg max \sc Rouge_{n} (R, S) ℓ (S) \leq L_{max} .

missing ma x imi z e_{z}

missing ma x imi z e_{z}

s . t .

\forall j : i = 1 \sum ∣ D ∣ N (g_{j}^{n}, s_{i}) x_{i} \geq z_{k j}

\forall j : N (g_{j}^{n}, R_{k}) \geq z_{k j}

\forall i : x_{i} \in {0, 1}

\forall j : z_{k j} \in Z_{+} .

\sc Rouge_{n} (R, V \cup W) = \sc Rouge_{n} (R, V) + \sc Rouge_{n}^{'} (R, V, W) .

\sc Rouge_{n} (R, V \cup W) = \sc Rouge_{n} (R, V) + \sc Rouge_{n}^{'} (R, V, W) .

\sc Rouge_{n}^{'} (R, V, W) = \frac{k = 1 \sum ∣ R ∣ t _{n} \in U ( R _{k} ) \sum min { N ( t _{n} , R _{k} ∖ V ) , N ( t _{n} , W )}}{k = 1 \sum ∣ R ∣ t _{n} \in U ( R _{k} )) \sum N ( t _{n} , R _{k} )} .

\sc Rouge_{n}^{'} (R, V, W) = \frac{k = 1 \sum ∣ R ∣ t _{n} \in U ( R _{k} ) \sum min { N ( t _{n} , R _{k} ∖ V ) , N ( t _{n} , W )}}{k = 1 \sum ∣ R ∣ t _{n} \in U ( R _{k} )) \sum N ( t _{n} , R _{k} )} .

\sc Rouge_{n} (R, V) = \sc Rouge_{n} (R, V) +

\sc Rouge_{n} (R, V) = \sc Rouge_{n} (R, V) +

Ω \subseteq W max {\sc Rouge_{n}^{'} (R, V, Ω) : ℓ (Ω) \leq L_{max} - ℓ (V)} .

Ω \subseteq W max {\sc Rouge_{n}^{'} (R, V, Ω) : ℓ (Ω) \leq L_{max} - ℓ (V)}

Ω \subseteq W max {\sc Rouge_{n}^{'} (R, V, Ω) : ℓ (Ω) \leq L_{max} - ℓ (V)}

\leq x max ⎩ ⎨ ⎧ i = 1 \sum ∣ W ∣ \sc Rouge_{n}^{'} (R, V, {w_{i}}) x_{i} :

i = 1 \sum ∣ W ∣ ℓ ({w_{i}}) x_{i} \leq L_{max} - ℓ (V) ⎭ ⎬ ⎫ .

\sc Rouge (R, V) + \sc Rouge_{n}^{'} (R, V, W) = \frac{k = 1 \sum ∣ R ∣ t _{n} \in U ( R _{k} ) \sum f ( t _{n} , R _{k} , V , W )}{k = 1 \sum ∣ R ∣ t _{n} \in U ( R _{k} ) \sum N ( t _{n} , R _{k} )} .

\sc Rouge (R, V) + \sc Rouge_{n}^{'} (R, V, W) = \frac{k = 1 \sum ∣ R ∣ t _{n} \in U ( R _{k} ) \sum f ( t _{n} , R _{k} , V , W )}{k = 1 \sum ∣ R ∣ t _{n} \in U ( R _{k} ) \sum N ( t _{n} , R _{k} )} .

f (t_{n}, R_{k}, V, W) = min {N (t_{n}, R_{k}), N (t_{n}, V)} + min {N (t_{n}, R_{k} ∖ V), N (t_{n}, W)} .

f (t_{n}, R_{k}, V, W) = min {N (t_{n}, R_{k}), N (t_{n}, V)} + min {N (t_{n}, R_{k} ∖ V), N (t_{n}, W)} .

f (t_{n}, R_{k}, V, W) = min {N (t_{n}, R_{k}), N (t_{n}, V)} + min {max {N (t_{n}, R_{k}) - N (t_{n}, V), 0}, N (t_{n}, W)} .

f (t_{n}, R_{k}, V, W) = min {N (t_{n}, R_{k}), N (t_{n}, V)} + min {max {N (t_{n}, R_{k}) - N (t_{n}, V), 0}, N (t_{n}, W)} .

f (t_{n}, R_{k}, V, W) =

f (t_{n}, R_{k}, V, W) =

\sc Rouge_{n} (R, V \cup W) = \frac{k = 1 \sum ∣ R ∣ t _{n} \in U ( R _{k} ) \sum min { N ( t _{n} , R _{k} ) , N ( t _{n} , V ) + N ( t _{n} , W )}}{k = 1 \sum ∣ R ∣ t _{n} \in U ( R _{k} ) \sum N ( t _{n} , R _{k} )}

\sc Rouge_{n} (R, V \cup W) = \frac{k = 1 \sum ∣ R ∣ t _{n} \in U ( R _{k} ) \sum min { N ( t _{n} , R _{k} ) , N ( t _{n} , V ) + N ( t _{n} , W )}}{k = 1 \sum ∣ R ∣ t _{n} \in U ( R _{k} ) \sum N ( t _{n} , R _{k} )}

\begin{array}[]{lr}C[0][j]=0&(0\leq j\leq L_{\rm max})\end{array}

\begin{array}[]{lr}C[0][j]=0&(0\leq j\leq L_{\rm max})\end{array}

\begin{split}\kern-10.00002ptC[i][j]{=}&\\ &\kern-50.00008pt\left\{\begin{array}[]{lr}C[i{-}1][j]+C[i{-}1][j{-}\ell(s_{i})]&\text{if }j{-}\ell(s_{i})\geq 0\\ C[i{-}1][j]&\text{otherwise}\end{array}\right.\end{split}

\begin{split}\kern-10.00002ptC[i][j]{=}&\\ &\kern-50.00008pt\left\{\begin{array}[]{lr}C[i{-}1][j]+C[i{-}1][j{-}\ell(s_{i})]&\text{if }j{-}\ell(s_{i})\geq 0\\ C[i{-}1][j]&\text{otherwise}\end{array}\right.\end{split}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques

Full text

Enumeration of Extractive Oracle Summaries

Tsutomu Hirao

Masaaki Nishino

Jun Suzuki

Masaaki Nagata

NTT Communication Science Laboratories, NTT Corporation

2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0237, Japan

{hirao.tsutomu,nishino.masaaki}@lab.ntt.co.jp

{suzuki.jun,nagata.masaaki}@lab.ntt.co.jp

Abstract

To analyze the limitations and the future directions of the extractive summarization paradigm, this paper proposes an Integer Linear Programming (ILP) formulation to obtain extractive oracle summaries in terms of $\text{\sc Rouge}_{n}$ . We also propose an algorithm that enumerates all of the oracle summaries for a set of reference summaries to exploit F-measures that evaluate which system summaries contain how many sentences that are extracted as an oracle summary. Our experimental results obtained from Document Understanding Conference (DUC) corpora demonstrated the following: (1) room still exists to improve the performance of extractive summarization; (2) the F-measures derived from the enumerated oracle summaries have significantly stronger correlations with human judgment than those derived from single oracle summaries.

1 Introduction

Recently, compressive and abstractive summarization are attracting attention (e.g., ?), ?), ?), ?), ?)). However, extractive summarization remains a primary research topic because the linguistic quality of the resultant summaries is guaranteed, at least at the sentence level, which is a key requirement for practical use (e.g., ?), ?), ?), ?)).

The summarization research community is experiencing a paradigm shift from extractive to compressive or abstractive summarization. Currently our question is: “Is extractive summarization still useful research?” To answer it, the ultimate limitations of the extractive summarization paradigm must be comprehended; that is, we have to determine its upper bound and compare it with the performance of the state-of-the-art summarization methods. Since $\text{\sc Rouge}_{n}$ is the de-facto automatic evaluation method and is employed in many text summarization studies, an oracle summary is defined as a set of sentences that have a maximum $\text{\sc Rouge}_{n}$ score. If the $\text{\sc Rouge}_{n}$ score of an oracle summary outperforms that of a system that employs another summarization approach, the extractive summarization paradigm is worthwhile to leverage research resources.

As another benefit, identifying an oracle summary for a set of reference summaries allows us to utilize yet another evaluation measure. Since both oracle and extractive summaries are sets of sentences, it is easy to check whether a system summary contains sentences in the oracle summary. As a result, F-measures, which are available to evaluate a system summary, are useful for evaluating classification-based extractive summarization [Mani and Bloedorn, 1998, Osborne, 2002, Hirao et al., 2002]. Since $\text{\sc Rouge}_{n}$ evaluation does not identify which sentence is important, an F-measure conveys useful information in terms of “important sentence extraction.” Thus, combining $\text{\sc Rouge}_{n}$ and an F-measure allows us to scrutinize the failure analysis of systems.

Note that more than one oracle summary might exist for a set of reference summaries because $\text{\sc Rouge}_{n}$ scores are based on the unweighted counting of n-grams. As a result, an F-measure might not be identical among multiple oracle summaries. Thus, we need to enumerate the oracle summaries for a set of reference summaries and compute the F-measures based on them.

In this paper, we first derive an Integer Linear Programming (ILP) problem to extract an oracle summary from a set of reference summaries and a source document(s). To the best of our knowledge, this is the first ILP formulation that extracts oracle summaries. Second, since it is difficult to enumerate oracle summaries for a set of reference summaries using ILP solvers, we propose an algorithm that efficiently enumerates all oracle summaries by exploiting the branch and bound technique. Our experimental results on the Document Understanding Conference (DUC) corpora showed the following:

Room still exists for the further improvement of extractive summarization, i.e., where the $\text{\sc Rouge}_{n}$ scores of the oracle summaries are significantly higher than those of the state-of-the-art summarization systems. 2. 2.

The F-measures derived from multiple oracle summaries obtain significantly stronger correlations with human judgment than those derived from single oracle summaries.

2 Definition of Extractive Oracle Summaries

We first briefly describe $\text{\sc Rouge}_{n}$ . Given set of reference summaries $\boldsymbol{R}$ and system summary $S$ , $\text{\sc Rouge}_{n}$ is defined as follows:

[TABLE]

${\cal R}_{k}$ denotes the multiple set of n-grams that occur in $k$ -th reference summary $R_{k}$ , and $\mathcal{S}$ denotes the multiple set of n-grams that appear in system-generated summary $S$ (a set of sentences). $N(g_{j}^{n},{\cal R}_{k})$ and $N(g_{j}^{n},{\cal S})$ return the number of occurrences of n-gram $g_{j}^{n}$ in the $k$ -th reference and system summaries, respectively. Function $U(\cdot)$ transforms a multiple set into a normal set. $\text{\sc Rouge}_{n}$ takes values in the range of $[0,1]$ , and when the n-gram occurrences of the system summary agree with those of the reference summary, the value is 1.

In this paper, we focus on extractive summarization, employ $\text{\sc Rouge}_{n}$ as an evaluation measure, and define the oracle summaries as follows:

[TABLE]

$D$ is the set of all the sentences contained in the input document(s), and $L_{\rm max}$ is the length limitation of the oracle summary. $\ell(S)$ indicates the number of words in the system summary. Eq. (2) is an NP-hard combinatorial optimization problem, and no polynomial time algorithms exist that can attain an optimal solution.

3 Related Work

?) utilized a naive exhaustive search method to obtain oracle summaries in terms of $\text{\sc Rouge}_{n}$ and exploited them to understand the limitations of extractive summarization systems. ?) proposed another naive exhaustive search method to derive a probability density function from the Rougen scores of oracle summaries for the domains to which source documents belong. The computational complexity of naive exhaustive methods is exponential to the size of the sentence set. Thus, it may be possible to apply them to single document summarization tasks involving a dozen sentences, but it is infeasible to apply them to multiple document summarization tasks that involve several hundred sentences.

To describe the difference between the $\text{\sc Rouge}_{n}$ scores of oracle and system summaries in multiple document summarization tasks, ?) proposed an approximate algorithm with a genetic algorithm (GA) to find oracle summaries. ?) utilized a greedy algorithm for the same purpose. Although GA or greedy algorithms are widely used to solve NP-hard combinatorial optimization problems, the solutions are not always optimal. Thus, the summary does not always have a maximum $\text{\sc Rouge}_{n}$ score for the set of reference summaries. Both works called the summary found by their methods the oracle, but it differs from the definition in our paper.

Since summarization systems cannot reproduce human-made reference summaries in most cases, oracle summaries, which can be reproduced by summarization systems, have been used as training data to tune the parameters of summarization systems. For example, ?) and ?) trained their summarizers with oracle summaries found by a greedy algorithm. ?) proposed a method to find a summary that approximates a Rouge score based on the Rouge scores of individual sentences and exploited the framework to train their summarizer. As mentioned above, such summaries do not always agree with the oracle summaries defined in our paper. Thus, the quality of the training data is suspect. Moreover, since these studies fail to consider that a set of reference summaries has multiple oracle summaries, the score of the loss function defined between their oracle and system summaries is not appropriate in most cases.

As mentioned above, no known efficient algorithm can extract “exact” oracle summaries, as defined in Eq. (2), i.e., because only a naive exhaustive search is available. Thus, such approximate algorithms as a greedy algorithm are mainly employed to obtain them.

4 Oracle Summary Extraction as an Integer Linear Programming (ILP) Problem

To extract an oracle summary from document(s) and a given set of reference summaries, we start by deriving an Integer Linear Programming (ILP) problem. Since the denominator of Eq. (1) is constant for a given set of reference summaries, we can find an oracle summary by maximizing the numerator of Eq. (1). Thus, the ILP formulation is defined as follows:

[TABLE]

Here, $z_{kj}$ is the count of the $j$ -th n-gram of the $k$ -th reference summary in the oracle summary, i.e., $z_{kj}=\min\{N(g_{j}^{n},{\cal R}_{k}),N(g_{j}^{n},{\cal S})\}$ . $\ell(\cdot)$ returns the number of words in the sentence, $x_{i}$ is a binary indicator, and $x_{i}=1$ denotes that the $i$ -th sentence $s_{i}$ is included in the oracle summary. $N(g_{j}^{n},s_{i})$ returns the number of occurrences of n-gram $g_{j}^{n}$ in the $i$ -th sentence. Constraints (5) and (6) ensure that $z_{kj}=\min\{N(g_{j}^{n},{\cal R}_{k}),N(g_{j}^{n},{\cal S})\}$ .

5 Branch and Bound Technique for Enumerating Oracle Summaries

Since enumerating oracle summaries with an ILP solver is difficult, we extend the exhaustive search approach by introducing a search and prune technique to enumerate the oracle summaries. The search pruning decision is made by comparing the current upper bound of the Rougen score with the maximum Rougen score in the search history.

5.1 $\text{\sc Rouge}_{n}$ Score for Two Distinct Sets of Sentences

The enumeration of oracle summaries can be regarded as a depth-first search on a tree whose nodes represent sentences. Fig. 1 shows an example of a search tree created in a naive exhaustive search. The nodes represent sentences and the path from the root node to an arbitrary node represents a summary. For example, the red path in Fig. 1 from the root node to node $s_{2}$ represents a summary consisting of sentences $s_{1},s_{2}$ . By utilizing the tree, we can enumerate oracle summaries by exploiting depth-first searches while excluding the summaries that violate length constraints. However, this naive exhaustive search approach is impractical for large data sets because the number of nodes inside the tree is $2^{|D|}$ .

If we prune the unwarranted subtrees in each step of the depth-first search, we can make the search more efficient. The decision to search or prune is made by comparing the current upper bound of the Rougen score with the maximum Rougen score in the search history. For instance, in Fig. 1, we reach node $s_{2}$ by following this path: “Root $\rightarrow$ $s_{1}$ , $\rightarrow$ $s_{2}$ ”. If we estimate the maximum $\text{\sc Rouge}_{n}$ score (upper bound) obtained by searching for the descendant of $s_{2}$ (the subtree in the blue rectangle), we can decide whether the depth-first search should be continued. When the upper bound of the $\text{\sc Rouge}_{n}$ score exceeds the current maximum $\text{\sc Rouge}_{n}$ in the search history, we have to continue. When the upper bound is smaller than the current maximum $\text{\sc Rouge}_{n}$ score, no summary is optimal that contains $s_{1}$ , $s_{2}$ , so we can skip subsequent search activity on the subtree and proceed to check the next branch: “Root $\rightarrow$ $s_{1}$ $\rightarrow$ $s_{3}$ ”.

To estimate the upper bound of the $\text{\sc Rouge}_{n}$ score, we re-define it for two distinct sets of sentences, $V$ and $W$ , i.e., $V\cap W=\phi$ , as follows:

[TABLE]

Here $\text{\sc Rouge}^{\prime}_{n}$ is defined as follows:

[TABLE]

${\cal V,W}$ are the multiple sets of n-grams found in the sets of sentences $V$ and $W$ , respectively.

Theorem 1.

Eq. (9) is correct.

Proof.

See Appendix A. ∎

5.2 Upper Bound of $\text{\sc Rouge}_{n}$

Let $V$ be the set of sentences on the path from the current node to the root node in the search tree, and let $W$ be the set of sentences that are the descendants of the current node. In Fig. 1, $V{=}\{s_{1},s_{2}\}$ and $W{=}\{s_{3},s_{4},s_{5},s_{6}\}$ . According to Theorem 1, the upper bound of the $\text{\sc Rouge}_{n}$ score is defined as:

[TABLE]

Since the second term on the right side in Eq. (5.2) is an NP-hard problem, we turn to the following relation by introducing inequality, $\text{\sc Rouge}^{\prime}_{n}(\boldsymbol{R},V,\Omega)\leq\sum_{\omega\in\Omega}\text{\sc Rouge}^{\prime}_{n}(\boldsymbol{R},V,\{\omega\})$ ,

[TABLE]

Here, $\mathbf{x}=(x_{1},\ldots,x_{|W|})$ and $x_{i}\in\{0,1\}$ . The right side of Eq. (5.2) is a knapsack problem, i.e., a 0-1 ILP problem. Although we can obtain the optimal solution for it using dynamic programming or ILP solvers, we solve its linear programming relaxation version by applying a greedy algorithm for greater computation efficiency. The solution output by the greedy algorithm is optimal for the relaxed problem. Since the optimal solution of the relaxed problem is always larger than that of the original problem, the relaxed problem solution can be utilized as the upper bound. Algorithm 1 shows the pseudocode that attains the upper bound of $\text{\sc Rouge}_{n}$ . In the algorithm, $U$ indicates the upper bound score of $\text{\sc Rouge}_{n}$ . We first set the initial score of upper bound $U$ to $\text{\sc Rouge}_{n}(\boldsymbol{R},V)$ (line 3). Then we compute the density of the $\text{\sc Rouge}^{\prime}_{n}$ scores ( $\text{\sc Rouge}_{n}^{\prime}(\boldsymbol{R},V,\{w\})/\ell(w)$ ) for each sentence $w$ in $W$ and sort them in descending order (lines 4 to 6). When we have room to add $w$ to the summary, we update $U$ by adding the $\text{\sc Rouge}^{\prime}_{n}(\boldsymbol{R},V,\{w\})$ (line 10) and update length constraint $L_{\rm max}$ (line 11). When we do not have room to add $w$ , we update $U$ by adding the score obtained by multiplying the density of $w$ by the remaining length, $L_{\rm max}$ (line 13), and exit the while loop.

5.3 Initial Score for Search

Since the branch and bound technique prunes the search by comparing the best solution found so far with the upper bounds, obtaining a good solution in the early stage is critical for raising search efficiency.

Since $\text{\sc Rouge}_{n}$ is a monotone submodular function [Lin and Bilmes, 2011], we can obtain a good approximate solution by a greedy algorithm [Khuller et al., 1999]. It is guaranteed that the score of the obtained approximate solution is larger than $\frac{1}{2}(1-\frac{1}{e})\text{OPT}$ , where OPT is the score of the optimal solution. We employ the solution as the initial Rougen score of the candidate oracle summary.

Algorithm 2 shows the greedy algorithm. In it, $S$ denotes a summary and $D$ denotes a set of sentences. The algorithm iteratively adds sentence $s^{*}$ that yields the largest gain in the $\text{\sc Rouge}_{n}$ score to current summary $S$ , provided the length of the summary does not violate length constraint $L_{\rm max}$ (line 4). After the while loop, the algorithm compares the $\text{\sc Rouge}_{n}$ score of $S$ with the maximum $\text{\sc Rouge}_{n}$ score of the single sentence and outputs the larger of the two scores (lines 11 to 13).

5.4 Enumeration of Oracle summaries

By introducing threshold $\tau$ as the best $\text{\sc Rouge}_{n}$ score in the search history, pruning decisions involve the following three conditions:

$\text{\sc Rouge}_{n}(\boldsymbol{R},V)\geq\tau$ ; 2. 2.

$\text{\sc Rouge}_{n}(\boldsymbol{R},V)<\tau$ , $\widehat{\text{\sc Rouge}_{n}}(\boldsymbol{R},V)<\tau$ ; 3. 3.

$\text{\sc Rouge}_{n}(\boldsymbol{R},V)<\tau$ , $\widehat{\text{\sc Rouge}_{n}}(\boldsymbol{R},V)\geq\tau$ .

With case 1, we update the oracle summary as $V$ and continue the search. With case 2, because both $\text{\sc Rouge}_{n}(\boldsymbol{R},V)$ and $\widehat{\text{\sc Rouge}_{n}}(\boldsymbol{R},V)$ are smaller than $\tau$ , the subtree whose root node is the current node (last visited node) is pruned from the search space, and we continue the depth-first search from the neighbor node. With case 3, we do not update oracle summary as $V$ because $\text{\sc Rouge}_{n}(\boldsymbol{R},V)$ is less than $\tau$ . However, we might obtain a better oracle summary by continuing the depth-first search because the upper bound of the $\text{\sc Rouge}_{n}$ score exceeds $\tau$ . Thus, we continue to search for the descendants of the current node.

Algorithm 3 shows the pseudocode that enumerates the oracle summaries. The algorithm reads a set of reference summaries $\boldsymbol{R}$ , length limitation $L_{\rm max}$ , and set of sentences $D$ (line 1) and initializes threshold $\tau$ as the $\text{\sc Rouge}_{n}$ score obtained by the greedy algorithm (Algorithm 2). It also initializes $O_{\tau}$ , which stores oracle summaries whose $\text{\sc Rouge}_{n}$ scores are $\tau$ , and priority queue $C$ , which stores the history of the depth-first search (line 2). Next, the algorithm computes the $\text{\sc Rouge}_{n}$ score for each sentence and stores $S$ after sorting them in descending order. After that, we start a depth-first search by recursively calling procedure FindOracle. In the procedure, we extract the top sentence from priority queue $Q$ and append it to priority queue $V$ (lines 11 to 12). When the length of $V$ is less than $L_{\rm max}$ , if $\text{\sc Rouge}_{n}(\boldsymbol{R},V)$ is larger than threshold $\tau$ (case 1), we update $\tau$ as the score and append current $V$ to $O_{\tau}$ . Then we continue the depth-first search by calling the procedure the FindOracle (lines 15 to 17). If $\widehat{\text{\sc Rouge}_{n}}(\boldsymbol{R},V)$ is larger than $\tau$ (case 3), we do not update $\tau$ and $O_{\tau}$ but reenter the depth-first search by calling the procedure again (lines 18 to 19). If neither case 1 nor case 3 is true, we delete the last visited sentence from $V$ and return to the top of the recurrence.

6 Experiments

6.1 Experimental Setting

We conducted experiments on the corpora developed for a multiple document summarization task in DUC 2001 to 2007. Table 1 show the statistics of the data. In particular, the DUC-2005 to -2007 data sets not only have very large numbers of sentences and words but also a long target length (the reference summary length) of 250 words.

All the words in the documents were stemmed by Porter’s stemmer [Porter, 1980]. We computed $\text{\sc Rouge}_{1}$ scores, excluding stopwords, and computed $\text{\sc Rouge}_{2}$ scores, keeping them. ?) suggested using $\text{\sc Rouge}_{1}$ and keeping stopwords. However, as Takamura et al. argued [Takamura and Okumura, 2009], the summaries optimized with non-content words failed to consider the actual quality. Thus, we excluded stopwords for computing the $\text{\sc Rouge}_{1}$ scores.

We enumerated the following two types of oracle summaries: those for a set of references for a given topic and those for each reference in the set of references.

6.2 Results and Discussion

6.2.1 Impact of Oracle Rougen scores

Table 2 shows the average $\text{\sc Rouge}_{1,2}$ scores of the oracle summaries obtained from both a set of references and each reference in the set (“multi” and “single”), those of the best conventional system (Peer), and those obtained from summaries produced by a greedy algorithm (Algorithm 2).

Oracle (single) obtained better $\text{\sc Rouge}_{1,2}$ scores than Oracle (multi). The results imply that it is easier to optimize a reference summary than a set of reference summaries. On the other hand, the $\text{\sc Rouge}_{1,2}$ scores of these oracle summaries are significantly higher than those of the best systems. The best systems obtained $\text{\sc Rouge}_{1}$ scores from 60% to 70% in “multi” and from 50% to 60% in “single” as well as $\text{\sc Rouge}_{2}$ scores from 40% to 55% in “multi” and from 30% to 40% in “single” for their oracle summaries.

Since the systems in Table 2 were developed over many years, we compared the $\text{\sc Rouge}_{n}$ scores of the oracle summaries with those of the current state-of-the-art systems using the DUC-2004 corpus and obtained summaries generated by different systems from a public repository111http://www.cis.upenn.edu/~nlp/corpora/sumrepo.html [Hong et al., 2014]. The repository includes summaries produced by the following seven state-of-the-art summarization systems: CLASSY04 [Conroy et al., 2004], CLASSY11 [Conroy et al., 2011], Submodular [Lin and Bilmes, 2012], DPP [Kulesza and Tasker, 2011], RegSum [Hong and Nenkova, 2014], OCCAMS_V [Davie et al., 2012, Conroy et al., 2013], and ICSISumm [Gillick and Favre, 2009, Gillick et al., 2009]. Table 3 shows the results.

Based on the results, RegSum [Hong and Nenkova, 2014] achieved the best $\text{\sc Rouge}_{1}{=}0.331$ result, while ICSISumm [Gillick and Favre, 2009, Gillick et al., 2009] (a compressive summarizer) achieved the best result with $\text{\sc Rouge}_{2}{=}0.098$ . These systems outperformed the best systems (Peers 65 and 67 in Table 2), but the differences in the $\text{\sc Rouge}_{n}$ scores between the systems and the oracle summaries are still large. More recently, ?) demonstrated that their system’s combination approach achieved the current best $\text{\sc Rouge}_{2}$ score, 0.105, for the DUC-2004 corpus. However, a large difference remains between the $\text{\sc Rouge}_{2}$ score of oracle and their summaries.

In short, the $\text{\sc Rouge}_{n}$ scores of the oracle summaries are significantly higher than those of the current state-of-the-art summarization systems, both extractive and compressive summarization. These results imply that further improvement of the performance of extractive summarization is possible.

On the other hand, the $\text{\sc Rouge}_{n}$ scores of the oracle summaries are far from $\text{\sc Rouge}_{n}=1$ . We believe that the results are related to the summary’s compression rate. The data set’s compression rate was only 1 to 2%. Thus, under tight length constraints, extractive summarization basically fails to cover large numbers of n-grams in the reference summary. This reveals the limitation of the extractive summarization paradigm and suggests that we need another direction, compressive or abstractive summarization, to overcome the limitation.

6.2.2 Rouge Scores of Summaries Obtained from Greedy Algorithm

Table 2 also shows the Rouge1,2 scores of the summaries obtained from the greedy algorithm (greedy summaries). Although there are statistically significant differences between the Rouge scores of the oracle summaries and greedy summaries, those obtained from the greedy summaries achieved near optimal scores, i.e., approximation ratio of them are close to 0.9. These results are surprising since the algorithm’s theoretical lower bound is $\frac{1}{2}(1-\frac{1}{e})(\simeq 0.32)$ OPT.

On the other hand, the results do not support that the differences between them are small at the sentence-level. Table 4 shows the average Jaccard Index between the oracle summaries and the corresponding greedy summaries for the DUC-2004 corpus. The results demonstrate that the oracle summaries are much less similar to the greedy summaries at the sentence-level. Thus, it might not be appropriate to use greedy summaries as training data for learning-based extractive summarization systems.

6.2.3 Impact of Enumeration

Table 5 shows the median number of oracle summaries and the rates of the reference summaries that have multiple oracle summaries for each data set. Over 80% of the reference summaries and about 60% to 90% of the topics have multiple oracle summaries. Since the $\text{\sc Rouge}_{n}$ scores are based on the unweighted counting of n-grams, when many sentences have similar meanings, i.e., many redundant sentences, the number of oracle summaries that have the same $\text{\sc Rouge}_{n}$ scores increases. The source documents of multiple document summarization tasks are prone to have many such redundant sentences, and the amount of oracle summaries is large.

The oracle summaries offer significant benefit with respect to evaluating the extracted sentences. Since both the oracle and system summaries are sets of sentences, it is easy to check whether each sentence in the system summary is contained in one of the oracle summaries. Thus, we can exploit the F-measures, which are useful for evaluating classification-based extractive summarization [Mani and Bloedorn, 1998, Osborne, 2002, Hirao et al., 2002]. Here, we have to consider that the oracle summaries, obtained from a reference summary or a set of reference summaries, are not identical at the sentence-level (e.g., the average Jaccard Index between the oracle summaries for the DUC-2004 corpus is around 0.5). The F-measures are varied with the oracle summaries that are used for such computation. For example, assume that we have system summary $S{=}\{s_{1},s_{2},s_{3},s_{4}\}$ and oracle summaries $O_{1}{=}\{s_{1},s_{2},s_{5},s_{6}\}$ and $O_{2}{=}\{s_{1},s_{2},s_{3}\}$ . The precision for $O_{1}$ is 0.5, while that for $O_{2}$ is 0.75; the recall for $O_{1}$ is 0.5, while that for $O_{2}$ is 1; the F-measure for $O_{1}$ is 0.5, while that for $O_{2}$ is 0.86.

Thus, we employ the scores gained by averaging all of the oracle summaries as evaluation measures. Precision, recall, and F-measure are defined as follows: $P{=}\{\sum_{O\in O_{\rm all}}|O\cap S|/|S|\}/|O_{\rm all}|$ , $R{=}\{\sum_{O\in O_{\rm all}}|O\cap S|/|O|\}/|O_{\rm all}|$ , $\text{F-measure}{=}2PR/(P+R)$ .

To demonstrate F-measure’s effectiveness, we investigated the correlation between an F-measure and human judgment based on the evaluation results obtained from the DUC-2004 corpus. The results include summaries generated by 17 systems, each of which has a mean coverage score assigned by a human subject. We computed the correlation coefficients between the average F-measure and the average mean coverage score for 50 topics. Table 6 shows Pearson’s $r$ and Spearman’s $\rho$ . In the table, “F-measure (R1)” and “F-measure (R2)” indicate the F-measures calculated using oracle summaries optimized to $\text{\sc Rouge}_{1}$ and $\text{\sc Rouge}_{2}$ , respectively. “M” indicates the F-measure calculated using multiple oracle summaries, and “S” indicates F-measures calculated using randomly selected oracle summaries. “multi” indicates oracle summaries obtained from a set of references, and “single” indicates oracle summaries obtained from a reference summary in the set. For “S,” we randomly selected a single oracle summary and calculated the F-measure 100 times and took the average value with the 95% confidence interval of the F-measures by bootstrap resampling.

The results demonstrate that the F-measures are strongly correlated with human judgment. Their values are comparable with those of $\text{\sc Rouge}_{1,2}$ . In particular, F-measure (R1) (single-M) achieved the best Spearman’s $\rho$ result. When comparing “single” with “multi,” Pearson’s $r$ of “multi” was slightly lower than that of “single,” and the Spearman’s $r$ of “multi” was almost the same as those of “single.” “M” has significantly better performance than “S.” These results imply that F-measures based on oracle summaries are a good evaluation measure and that oracle summaries have the potential to be an alternative to human-made reference summaries in terms of automatic evaluation. Moreover, the enumeration of the oracle summaries for a given reference summary or a set of reference summaries is essential for automatic evaluation.

6.2.4 Search Efficiency

To demonstrate the efficiency of our search algorithm against the naive exhaustive search method, we compared the number of feasible solutions (sets of sentences that satisfy the length constraint) with the number of summaries that were checked in our search algorithm. The algorithm that counts the number of feasible solutions is shown in Appendix B.

Table 7 shows the median number of feasible solutions and checked summaries yielded by our method for each data set (in the case of “single”). The differences in the number of feasible solutions between $\text{\sc Rouge}_{1}$ and $\text{\sc Rouge}_{2}$ are very large. Input set ( $|D|$ ) of $\text{\sc Rouge}_{1}$ is much larger than $\text{\sc Rouge}_{1}$ . On the other hand, the differences between $\text{\sc Rouge}_{1}$ and $\text{\sc Rouge}_{2}$ in our method are of the order of $10$ to $10^{2}$ . When comparing our method with naive exhaustive searches, its search space is significantly smaller. The differences are of the order of $10^{7}$ to $10^{30}$ with $\text{\sc Rouge}_{1}$ and $10^{4}$ to $10^{17}$ with $\text{\sc Rouge}_{2}$ . These results demonstrate the efficiency of our branch and bound technique.

In addition, we show an example of the processing time for extracting one oracle summary and enumerating all of the oracle summaries for the reference summaries in the DUC-2004 corpus with a Linux machine (CPU: Intel*®* Xeon*®* X5675 (3.07GHz)) with 192 GB of RAM. We utilized CPLEX 12.1 to solve the ILP problem. Our algorithm was implemented in C++ and complied with GCC version 4.4.7. The results show that we needed 0.026 and 0.021 sec. to extract one oracle summary per reference summary and 0.047 and 0.031 sec. to extract one oracle summary per set of reference summaries for $\text{\sc Rouge}_{1}$ and $\text{\sc Rouge}_{2}$ , respectively. We needed 11.90 and 1.40 sec. to enumerate the oracle summaries per reference summary and 102.94 and 3.65 sec. per set of reference summaries for $\text{\sc Rouge}_{1}$ and $\text{\sc Rouge}_{2}$ , respectively. The extraction of one oracle summary for a reference summary can be achieved with the ILP solver in practical time and the enumeration of oracle summaries is also efficient. However, to enumerate oracle summaries, we needed several weeks for some topics in DUCs 2005 to 2007 since they hold a huge number of source sentences.

7 Conclusions

To analyze the limitations and the future direction of extractive summarization, this paper proposed (1) Integer Linear Programming (ILP) formulation to obtain extractive oracle summaries in terms of Rougen scores and (2) an algorithm that enumerates all oracle summaries to exploit F-measures that evaluate the sentences extracted by systems.

The evaluation results obtained from the corpora of DUCs 2001 to 2007 identified the following: (1) room still exists to improve the $\text{\sc Rouge}_{n}$ scores of extractive summarization systems even though the $\text{\sc Rouge}_{n}$ scores of the oracle summaries fell below the theoretical upper bound $\text{\sc Rouge}_{n}{=}1$ . (2) Over 80% of the reference summaries and from 60% to 90% of the sets of reference summaries have multiple oracle summaries, and the F-measures computed by utilizing the enumerated oracle summaries showed stronger correlation with human judgment than those computed from single oracle summaries.

Appendix A.

Proof.

We can rewrite the right side of equation (9) as follows:

[TABLE]

Here, $f(t_{n},{\cal R}_{k},{\cal V,W})$ is defined as follows:

[TABLE]

$N(t_{n},{\cal R}_{k}\setminus{\cal V})$ is the number of times $t_{n}$ occurs in the multiple set ${\cal R}_{k}\setminus{\cal V}$ . Equation (14) is rewritten as

[TABLE]

The solutions of equation (15) are obtained by considering the following three conditions:

If $N(t_{n},{\cal R}_{k})-N(t_{n},{\cal V})>0$ and $N(t_{n},{\cal R}_{k})-N(t_{n},{\cal V})>N(t_{n},{\cal W})$ , then $f(t_{n},{\cal R}_{k},{\cal V,W})=N(t_{n},{\cal V})+N(t_{n},{\cal W})$ 2. 2.

If $N(t_{n},{\cal R}_{k})-N(t_{n},{\cal V})>0$ and $N(t_{n},{\cal R}_{k})-N(t_{n},{\cal V})<N(t_{n},{\cal W})$ , then $f(t_{n},{\cal R}_{k},{\cal V,W})=N(t_{n},{\cal R}_{k})$ 3. 3.

If $N(t_{n},{\cal R}_{k})-N(t_{n},{\cal V})<0$ , then $f(t_{n},{\cal R}_{k},{\cal V,W})=N(t_{n},{\cal R}_{k})$

From the above relations,

[TABLE]

Thus,

[TABLE]

∎

Appendix B.

We propose an algorithm to compute the number of feasible solutions under the length constraint by extending the dynamic programming based approach for the subset sum problem [Cormen et al., 2009]. We define $C[i][j](0\leq i\leq|D|,0\leq j\leq L_{\rm max})$ , which stores the number of feasible solutions (length is less than $j$ ) that can be obtained from set $\{s_{1},\ldots,s_{i}\}$ as follows:

•

Initialization

[TABLE]

•

Recurrence ( $1\leq i\leq|D|$ )

[TABLE]

Algorithm 4 is a dynamic program that fills out the ( $|D|+1)\times(L_{\rm max}+1$ ) table. After the table is filled, each cell on the $|D|+1$ -th line stores the number of feasible solutions. In the algorithm, first, we pick up the sentences that contain an n-gram that appears in the reference summary at least once and recursively count the number of feasible solutions. Then, the sum of the $j$ -th line whose index is from 1 to $L_{\rm max}$ indicates the number of feasible solutions. The order of the algorithm is $O(nL_{\rm max})$ .

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Almeida and Martins, 2013] Miguel B. Almeida and Andr e ´ ´ e \acute{\text{e}} F.T. Martins. 2013. Fast and robust compressive summarization with dual decomposition and multi-task learning. In Proc. of the 51st Annual Meeting of the Association for Computational Linguistics , pages 196–206.
2[Banerjee et al., 2015] Soddhartha Banerjee, Prasenjit Mitra, and Kazunari Sugiyama. 2015. Multi-document abstractive summarization using ILP based multi-sentence compression. In Proc. of the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015) , pages 1208–1214.
3[Bing et al., 2015] Lidong Bing, Piji Li, Yi Liao, Wai Lam, Weiwei Guo, and Rebecca J. Passonneau. 2015. Abstractive multi-document summarization via phrase selection and merging. In Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics , pages 1587–1597.
4[Ceylan et al., 2010] Hakan Ceylan, Rada Mihalcea, Umut Özertem, Elena Lloret, and Manuel Palomar. 2010. Quantifying the limits and success of extractive summarization systems across domains. In Proc. of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics , pages 903–911.
5[Conroy et al., 2004] John M. Conroy, Jade Goldstein, Judith D. Schlesinger, and Dianne P. O’Leary. 2004. Left-brain/right-brain multi-document summarization. In Proc. of the Document Understanding Conference (DUC) .
6[Conroy et al., 2011] John M. Conroy, Judith D. Schlesinger, Jeff Kubina, Peter A. Rankel, and Dianne P. O’Leary. 2011. Classy 2011 at TAC: Guided and multi-lingual summaries and evaluation metrics. In Proc. of the Text Analysis Conference (TAC) .
7[Conroy et al., 2013] John M. Conroy, Sashka T. Davis, Jeff Kubina, Yi-Kai Liu, Dianne P. O’Leary, and Judith D Schlesinger. 2013. Multilingual summarization: Dimensionality reduction and a step towards optimal term coverage. In Proc. of the Multi Ling 2013 Workshop on Multilingual Multi-document Summarization , pages 55–63.
8[Cormen et al., 2009] Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and Charles E. Leiserson. 2009. Introduction to Algorithms . The MIT Press, 3rd edition.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Enumeration of Extractive Oracle Summaries

Abstract

1 Introduction

2 Definition of Extractive Oracle Summaries

3 Related Work

4 Oracle Summary Extraction as an Integer Linear Programming (ILP) Problem

5 Branch and Bound Technique for Enumerating Oracle Summaries

5.1 \scRougen\text{\sc Rouge}_{n}\scRougen​ Score for Two Distinct Sets of Sentences

Theorem 1**.**

Proof.

5.2 Upper Bound of \scRougen\text{\sc Rouge}_{n}\scRougen​

5.3 Initial Score for Search

5.4 Enumeration of Oracle summaries

6 Experiments

6.1 Experimental Setting

6.2 Results and Discussion

6.2.1 Impact of Oracle Rougen scores

6.2.2 Rouge Scores of Summaries Obtained from Greedy Algorithm

6.2.3 Impact of Enumeration

6.2.4 Search Efficiency

7 Conclusions

Appendix A.

Proof.

Appendix B.

5.1 $\text{\sc Rouge}_{n}$ Score for Two Distinct Sets of Sentences

Theorem 1.

5.2 Upper Bound of $\text{\sc Rouge}_{n}$