Enumeration of Extractive Oracle Summaries
Tsutomu Hirao, Masaaki Nishino, Jun Suzuki, Masaaki Nagata

TL;DR
This paper introduces an ILP-based method to enumerate all extractive oracle summaries, revealing potential for improving summarization performance and better aligning automatic metrics with human judgment.
Contribution
It presents a novel ILP formulation and enumeration algorithm for extractive oracle summaries, enhancing analysis of summarization quality and evaluation.
Findings
Enumerated oracle summaries correlate better with human judgment.
Room for improvement in extractive summarization performance.
F-measures from enumeration outperform single oracle summaries.
Abstract
To analyze the limitations and the future directions of the extractive summarization paradigm, this paper proposes an Integer Linear Programming (ILP) formulation to obtain extractive oracle summaries in terms of ROUGE-N. We also propose an algorithm that enumerates all of the oracle summaries for a set of reference summaries to exploit F-measures that evaluate which system summaries contain how many sentences that are extracted as an oracle summary. Our experimental results obtained from Document Understanding Conference (DUC) corpora demonstrated the following: (1) room still exists to improve the performance of extractive summarization; (2) the F-measures derived from the enumerated oracle summaries have significantly stronger correlations with human judgment than those derived from single oracle summaries.
| Year | Topics | Docs. | Sents. | Words | Refs. | Length |
|---|---|---|---|---|---|---|
| 01 | 30 | 10 | 365 | 7706 | 89 | 100 |
| 02 | 59 | 10 | 238 | 4822 | 116 | 100 |
| 03 | 30 | 10 | 245 | 5711 | 120 | 100 |
| 04 | 50 | 10 | 218 | 4870 | 200 | 100 |
| 05 | 50 | 29.5 | 885 | 18273.5 | 300 | 250 |
| 06 | 50 | 25 | 732.5 | 15997.5 | 200 | 250 |
| 07 | 45 | 25 | 516 | 11427 | 180 | 250 |
| 01 | 02 | 03 | 04 | 05 | 06 | 07 | ||||||||
| R1 | R2 | R1 | R2 | R1 | R2 | R1 | R2 | R1 | R2 | R1 | R2 | R1 | R2 | |
| Oracle (multi) | .400 | .164 | .452 | .186 | .434 | .185 | .427 | .162 | .445 | .177 | .491 | .211 | .506 | .236 |
| Oracle (single) | .500 | .226 | .515 | .225 | .525 | .258 | .519 | .228 | .574 | .279 | .607 | .303 | .622 | .330 |
| Greedy | .387 | .161 | .438 | .184 | .424 | .182 | .412 | .157 | .430 | .173 | .473 | .206 | .495 | .234 |
| Peer | .251 | .080 | .269 | .080 | .295 | .094 | .305 | .092 | .262 | .073 | .305 | .095 | .363 | .117 |
| ID | T | T | 19 | 19 | 26 | 13 | 67 | 65 | 10 | 15 | 23 | 24 | 29 | 15 |
| System | ||
|---|---|---|
| Oracle (multi) | .427 | .162 |
| Oracle (single) | .519 | .228 |
| CLASSY04 | .305 | .0897 |
| CLASSY11 | .286 | .0919 |
| Submodular | .300 | .0933 |
| DPP | .309 | .0960 |
| RegSum | .331 | .0974 |
| OCCAMS_V | .300 | .0974 |
| ICSISumm | .310 | .0980 |
| single | multi | |
|---|---|---|
| Rouge1 | .451 | .419 |
| Rouge2 | .536 | .530 |
| Median | Rate | |||||||
|---|---|---|---|---|---|---|---|---|
| single | multi | single | multi | |||||
| Rouge1 | Rouge2 | Rouge1 | Rouge2 | Rouge1 | Rouge2 | Rouge1 | Rouge2 | |
| 01 | 8 | 9 | 4 | 5 | .854 | .787 | .833 | .733 |
| 02 | 7.5 | 5.5 | 4 | 4 | .897 | .836 | .814 | .780 |
| 03 | 8 | 10.5 | 3.5 | 4 | .833 | .858 | .800 | .900 |
| 04 | 8 | 8 | 3.5 | 3 | .865 | .865 | .780 | .760 |
| 05 | 35 | 35.5 | 2 | 3 | .916 | .907 | .580 | .660 |
| 06 | 28 | 22 | 2.5 | 3 | .877 | .880 | .700 | .720 |
| 07 | 23 | 16 | 4 | 2 | .910 | .878 | .733 | 711 |
| Metric | ||
|---|---|---|
| .861 | .760 | |
| .907 | .831 | |
| F-measure (R1) (single-M) | .857 | .855 |
| F-measure (R1) (single-S) | .815-.830 | .811-.830 |
| F-measure (R2) (single-M) | .904 | .826 |
| F-measure (R2) (single-S) | .855-.865 | .740-.760 |
| F-measure (R1) (multi-M) | .814 | .841 |
| F-measure (R1) (multi-S) | .794-.802 | .803-.813 |
| F-measure (R2) (multi-M) | .824 | .846 |
| F-measure (R2) (multi-S) | .806-.816 | .797-.817 |
| Rouge1 | Rouge2 | |||
|---|---|---|---|---|
| Naive | Proposed | Naive | Proposed | |
| 01 | 3.66 | 5.75 | 3.32 | 1.00 |
| 02 | 1.12 | 4.64 | 1.34 | 8.87 |
| 03 | 1.62 | 3.65 | 6.37 | 8.19 |
| 04 | 9.65 | 4.47 | 6.90 | 9.83 |
| 05 | 5.48 | 2.32 | 3.48 | 7.03 |
| 06 | 1.94 | 1.97 | 2.11 | 5.08 |
| 07 | 4.14 | 1.40 | 1.81 | 2.60 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques
Enumeration of Extractive Oracle Summaries
Tsutomu Hirao
Masaaki Nishino
Jun Suzuki
Masaaki Nagata
NTT Communication Science Laboratories, NTT Corporation
2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0237, Japan
{hirao.tsutomu,nishino.masaaki}@lab.ntt.co.jp
{suzuki.jun,nagata.masaaki}@lab.ntt.co.jp
Abstract
To analyze the limitations and the future directions of the extractive summarization paradigm, this paper proposes an Integer Linear Programming (ILP) formulation to obtain extractive oracle summaries in terms of . We also propose an algorithm that enumerates all of the oracle summaries for a set of reference summaries to exploit F-measures that evaluate which system summaries contain how many sentences that are extracted as an oracle summary. Our experimental results obtained from Document Understanding Conference (DUC) corpora demonstrated the following: (1) room still exists to improve the performance of extractive summarization; (2) the F-measures derived from the enumerated oracle summaries have significantly stronger correlations with human judgment than those derived from single oracle summaries.
1 Introduction
Recently, compressive and abstractive summarization are attracting attention (e.g., ?), ?), ?), ?), ?)). However, extractive summarization remains a primary research topic because the linguistic quality of the resultant summaries is guaranteed, at least at the sentence level, which is a key requirement for practical use (e.g., ?), ?), ?), ?)).
The summarization research community is experiencing a paradigm shift from extractive to compressive or abstractive summarization. Currently our question is: “Is extractive summarization still useful research?” To answer it, the ultimate limitations of the extractive summarization paradigm must be comprehended; that is, we have to determine its upper bound and compare it with the performance of the state-of-the-art summarization methods. Since is the de-facto automatic evaluation method and is employed in many text summarization studies, an oracle summary is defined as a set of sentences that have a maximum score. If the score of an oracle summary outperforms that of a system that employs another summarization approach, the extractive summarization paradigm is worthwhile to leverage research resources.
As another benefit, identifying an oracle summary for a set of reference summaries allows us to utilize yet another evaluation measure. Since both oracle and extractive summaries are sets of sentences, it is easy to check whether a system summary contains sentences in the oracle summary. As a result, F-measures, which are available to evaluate a system summary, are useful for evaluating classification-based extractive summarization [Mani and Bloedorn, 1998, Osborne, 2002, Hirao et al., 2002]. Since evaluation does not identify which sentence is important, an F-measure conveys useful information in terms of “important sentence extraction.” Thus, combining and an F-measure allows us to scrutinize the failure analysis of systems.
Note that more than one oracle summary might exist for a set of reference summaries because scores are based on the unweighted counting of n-grams. As a result, an F-measure might not be identical among multiple oracle summaries. Thus, we need to enumerate the oracle summaries for a set of reference summaries and compute the F-measures based on them.
In this paper, we first derive an Integer Linear Programming (ILP) problem to extract an oracle summary from a set of reference summaries and a source document(s). To the best of our knowledge, this is the first ILP formulation that extracts oracle summaries. Second, since it is difficult to enumerate oracle summaries for a set of reference summaries using ILP solvers, we propose an algorithm that efficiently enumerates all oracle summaries by exploiting the branch and bound technique. Our experimental results on the Document Understanding Conference (DUC) corpora showed the following:
Room still exists for the further improvement of extractive summarization, i.e., where the scores of the oracle summaries are significantly higher than those of the state-of-the-art summarization systems. 2. 2.
The F-measures derived from multiple oracle summaries obtain significantly stronger correlations with human judgment than those derived from single oracle summaries.
2 Definition of Extractive Oracle Summaries
We first briefly describe . Given set of reference summaries and system summary , is defined as follows:
[TABLE]
denotes the multiple set of n-grams that occur in -th reference summary , and denotes the multiple set of n-grams that appear in system-generated summary (a set of sentences). and return the number of occurrences of n-gram in the -th reference and system summaries, respectively. Function transforms a multiple set into a normal set. takes values in the range of , and when the n-gram occurrences of the system summary agree with those of the reference summary, the value is 1.
In this paper, we focus on extractive summarization, employ as an evaluation measure, and define the oracle summaries as follows:
[TABLE]
is the set of all the sentences contained in the input document(s), and is the length limitation of the oracle summary. indicates the number of words in the system summary. Eq. (2) is an NP-hard combinatorial optimization problem, and no polynomial time algorithms exist that can attain an optimal solution.
3 Related Work
?) utilized a naive exhaustive search method to obtain oracle summaries in terms of and exploited them to understand the limitations of extractive summarization systems. ?) proposed another naive exhaustive search method to derive a probability density function from the Rougen scores of oracle summaries for the domains to which source documents belong. The computational complexity of naive exhaustive methods is exponential to the size of the sentence set. Thus, it may be possible to apply them to single document summarization tasks involving a dozen sentences, but it is infeasible to apply them to multiple document summarization tasks that involve several hundred sentences.
To describe the difference between the scores of oracle and system summaries in multiple document summarization tasks, ?) proposed an approximate algorithm with a genetic algorithm (GA) to find oracle summaries. ?) utilized a greedy algorithm for the same purpose. Although GA or greedy algorithms are widely used to solve NP-hard combinatorial optimization problems, the solutions are not always optimal. Thus, the summary does not always have a maximum score for the set of reference summaries. Both works called the summary found by their methods the oracle, but it differs from the definition in our paper.
Since summarization systems cannot reproduce human-made reference summaries in most cases, oracle summaries, which can be reproduced by summarization systems, have been used as training data to tune the parameters of summarization systems. For example, ?) and ?) trained their summarizers with oracle summaries found by a greedy algorithm. ?) proposed a method to find a summary that approximates a Rouge score based on the Rouge scores of individual sentences and exploited the framework to train their summarizer. As mentioned above, such summaries do not always agree with the oracle summaries defined in our paper. Thus, the quality of the training data is suspect. Moreover, since these studies fail to consider that a set of reference summaries has multiple oracle summaries, the score of the loss function defined between their oracle and system summaries is not appropriate in most cases.
As mentioned above, no known efficient algorithm can extract “exact” oracle summaries, as defined in Eq. (2), i.e., because only a naive exhaustive search is available. Thus, such approximate algorithms as a greedy algorithm are mainly employed to obtain them.
4 Oracle Summary Extraction as an Integer Linear Programming (ILP) Problem
To extract an oracle summary from document(s) and a given set of reference summaries, we start by deriving an Integer Linear Programming (ILP) problem. Since the denominator of Eq. (1) is constant for a given set of reference summaries, we can find an oracle summary by maximizing the numerator of Eq. (1). Thus, the ILP formulation is defined as follows:
[TABLE]
Here, is the count of the -th n-gram of the -th reference summary in the oracle summary, i.e., . returns the number of words in the sentence, is a binary indicator, and denotes that the -th sentence is included in the oracle summary. returns the number of occurrences of n-gram in the -th sentence. Constraints (5) and (6) ensure that .
5 Branch and Bound Technique for Enumerating Oracle Summaries
Since enumerating oracle summaries with an ILP solver is difficult, we extend the exhaustive search approach by introducing a search and prune technique to enumerate the oracle summaries. The search pruning decision is made by comparing the current upper bound of the Rougen score with the maximum Rougen score in the search history.
5.1 Score for Two Distinct Sets of Sentences
The enumeration of oracle summaries can be regarded as a depth-first search on a tree whose nodes represent sentences. Fig. 1 shows an example of a search tree created in a naive exhaustive search. The nodes represent sentences and the path from the root node to an arbitrary node represents a summary. For example, the red path in Fig. 1 from the root node to node represents a summary consisting of sentences . By utilizing the tree, we can enumerate oracle summaries by exploiting depth-first searches while excluding the summaries that violate length constraints. However, this naive exhaustive search approach is impractical for large data sets because the number of nodes inside the tree is .
If we prune the unwarranted subtrees in each step of the depth-first search, we can make the search more efficient. The decision to search or prune is made by comparing the current upper bound of the Rougen score with the maximum Rougen score in the search history. For instance, in Fig. 1, we reach node by following this path: “Root , ”. If we estimate the maximum score (upper bound) obtained by searching for the descendant of (the subtree in the blue rectangle), we can decide whether the depth-first search should be continued. When the upper bound of the score exceeds the current maximum in the search history, we have to continue. When the upper bound is smaller than the current maximum score, no summary is optimal that contains , , so we can skip subsequent search activity on the subtree and proceed to check the next branch: “Root ”.
To estimate the upper bound of the score, we re-define it for two distinct sets of sentences, and , i.e., , as follows:
[TABLE]
Here is defined as follows:
[TABLE]
are the multiple sets of n-grams found in the sets of sentences and , respectively.
Theorem 1**.**
Eq. (9) is correct.
Proof.
See Appendix A. ∎
5.2 Upper Bound of
Let be the set of sentences on the path from the current node to the root node in the search tree, and let be the set of sentences that are the descendants of the current node. In Fig. 1, and . According to Theorem 1, the upper bound of the score is defined as:
[TABLE]
Since the second term on the right side in Eq. (5.2) is an NP-hard problem, we turn to the following relation by introducing inequality, ,
[TABLE]
Here, and . The right side of Eq. (5.2) is a knapsack problem, i.e., a 0-1 ILP problem. Although we can obtain the optimal solution for it using dynamic programming or ILP solvers, we solve its linear programming relaxation version by applying a greedy algorithm for greater computation efficiency. The solution output by the greedy algorithm is optimal for the relaxed problem. Since the optimal solution of the relaxed problem is always larger than that of the original problem, the relaxed problem solution can be utilized as the upper bound. Algorithm 1 shows the pseudocode that attains the upper bound of . In the algorithm, indicates the upper bound score of . We first set the initial score of upper bound to (line 3). Then we compute the density of the scores () for each sentence in and sort them in descending order (lines 4 to 6). When we have room to add to the summary, we update by adding the (line 10) and update length constraint (line 11). When we do not have room to add , we update by adding the score obtained by multiplying the density of by the remaining length, (line 13), and exit the while loop.
5.3 Initial Score for Search
Since the branch and bound technique prunes the search by comparing the best solution found so far with the upper bounds, obtaining a good solution in the early stage is critical for raising search efficiency.
Since is a monotone submodular function [Lin and Bilmes, 2011], we can obtain a good approximate solution by a greedy algorithm [Khuller et al., 1999]. It is guaranteed that the score of the obtained approximate solution is larger than , where OPT is the score of the optimal solution. We employ the solution as the initial Rougen score of the candidate oracle summary.
Algorithm 2 shows the greedy algorithm. In it, denotes a summary and denotes a set of sentences. The algorithm iteratively adds sentence that yields the largest gain in the score to current summary , provided the length of the summary does not violate length constraint (line 4). After the while loop, the algorithm compares the score of with the maximum score of the single sentence and outputs the larger of the two scores (lines 11 to 13).
5.4 Enumeration of Oracle summaries
By introducing threshold as the best score in the search history, pruning decisions involve the following three conditions:
; 2. 2.
, ; 3. 3.
, .
With case 1, we update the oracle summary as and continue the search. With case 2, because both and are smaller than , the subtree whose root node is the current node (last visited node) is pruned from the search space, and we continue the depth-first search from the neighbor node. With case 3, we do not update oracle summary as because is less than . However, we might obtain a better oracle summary by continuing the depth-first search because the upper bound of the score exceeds . Thus, we continue to search for the descendants of the current node.
Algorithm 3 shows the pseudocode that enumerates the oracle summaries. The algorithm reads a set of reference summaries , length limitation , and set of sentences (line 1) and initializes threshold as the score obtained by the greedy algorithm (Algorithm 2). It also initializes , which stores oracle summaries whose scores are , and priority queue , which stores the history of the depth-first search (line 2). Next, the algorithm computes the score for each sentence and stores after sorting them in descending order. After that, we start a depth-first search by recursively calling procedure FindOracle. In the procedure, we extract the top sentence from priority queue and append it to priority queue (lines 11 to 12). When the length of is less than , if is larger than threshold (case 1), we update as the score and append current to . Then we continue the depth-first search by calling the procedure the FindOracle (lines 15 to 17). If is larger than (case 3), we do not update and but reenter the depth-first search by calling the procedure again (lines 18 to 19). If neither case 1 nor case 3 is true, we delete the last visited sentence from and return to the top of the recurrence.
6 Experiments
6.1 Experimental Setting
We conducted experiments on the corpora developed for a multiple document summarization task in DUC 2001 to 2007. Table 1 show the statistics of the data. In particular, the DUC-2005 to -2007 data sets not only have very large numbers of sentences and words but also a long target length (the reference summary length) of 250 words.
All the words in the documents were stemmed by Porter’s stemmer [Porter, 1980]. We computed scores, excluding stopwords, and computed scores, keeping them. ?) suggested using and keeping stopwords. However, as Takamura et al. argued [Takamura and Okumura, 2009], the summaries optimized with non-content words failed to consider the actual quality. Thus, we excluded stopwords for computing the scores.
We enumerated the following two types of oracle summaries: those for a set of references for a given topic and those for each reference in the set of references.
6.2 Results and Discussion
6.2.1 Impact of Oracle Rougen scores
Table 2 shows the average scores of the oracle summaries obtained from both a set of references and each reference in the set (“multi” and “single”), those of the best conventional system (Peer), and those obtained from summaries produced by a greedy algorithm (Algorithm 2).
Oracle (single) obtained better scores than Oracle (multi). The results imply that it is easier to optimize a reference summary than a set of reference summaries. On the other hand, the scores of these oracle summaries are significantly higher than those of the best systems. The best systems obtained scores from 60% to 70% in “multi” and from 50% to 60% in “single” as well as scores from 40% to 55% in “multi” and from 30% to 40% in “single” for their oracle summaries.
Since the systems in Table 2 were developed over many years, we compared the scores of the oracle summaries with those of the current state-of-the-art systems using the DUC-2004 corpus and obtained summaries generated by different systems from a public repository111http://www.cis.upenn.edu/~nlp/corpora/sumrepo.html [Hong et al., 2014]. The repository includes summaries produced by the following seven state-of-the-art summarization systems: CLASSY04 [Conroy et al., 2004], CLASSY11 [Conroy et al., 2011], Submodular [Lin and Bilmes, 2012], DPP [Kulesza and Tasker, 2011], RegSum [Hong and Nenkova, 2014], OCCAMS_V [Davie et al., 2012, Conroy et al., 2013], and ICSISumm [Gillick and Favre, 2009, Gillick et al., 2009]. Table 3 shows the results.
Based on the results, RegSum [Hong and Nenkova, 2014] achieved the best result, while ICSISumm [Gillick and Favre, 2009, Gillick et al., 2009] (a compressive summarizer) achieved the best result with . These systems outperformed the best systems (Peers 65 and 67 in Table 2), but the differences in the scores between the systems and the oracle summaries are still large. More recently, ?) demonstrated that their system’s combination approach achieved the current best score, 0.105, for the DUC-2004 corpus. However, a large difference remains between the score of oracle and their summaries.
In short, the scores of the oracle summaries are significantly higher than those of the current state-of-the-art summarization systems, both extractive and compressive summarization. These results imply that further improvement of the performance of extractive summarization is possible.
On the other hand, the scores of the oracle summaries are far from . We believe that the results are related to the summary’s compression rate. The data set’s compression rate was only 1 to 2%. Thus, under tight length constraints, extractive summarization basically fails to cover large numbers of n-grams in the reference summary. This reveals the limitation of the extractive summarization paradigm and suggests that we need another direction, compressive or abstractive summarization, to overcome the limitation.
6.2.2 Rouge Scores of Summaries Obtained from Greedy Algorithm
Table 2 also shows the Rouge1,2 scores of the summaries obtained from the greedy algorithm (greedy summaries). Although there are statistically significant differences between the Rouge scores of the oracle summaries and greedy summaries, those obtained from the greedy summaries achieved near optimal scores, i.e., approximation ratio of them are close to 0.9. These results are surprising since the algorithm’s theoretical lower bound is OPT.
On the other hand, the results do not support that the differences between them are small at the sentence-level. Table 4 shows the average Jaccard Index between the oracle summaries and the corresponding greedy summaries for the DUC-2004 corpus. The results demonstrate that the oracle summaries are much less similar to the greedy summaries at the sentence-level. Thus, it might not be appropriate to use greedy summaries as training data for learning-based extractive summarization systems.
6.2.3 Impact of Enumeration
Table 5 shows the median number of oracle summaries and the rates of the reference summaries that have multiple oracle summaries for each data set. Over 80% of the reference summaries and about 60% to 90% of the topics have multiple oracle summaries. Since the scores are based on the unweighted counting of n-grams, when many sentences have similar meanings, i.e., many redundant sentences, the number of oracle summaries that have the same scores increases. The source documents of multiple document summarization tasks are prone to have many such redundant sentences, and the amount of oracle summaries is large.
The oracle summaries offer significant benefit with respect to evaluating the extracted sentences. Since both the oracle and system summaries are sets of sentences, it is easy to check whether each sentence in the system summary is contained in one of the oracle summaries. Thus, we can exploit the F-measures, which are useful for evaluating classification-based extractive summarization [Mani and Bloedorn, 1998, Osborne, 2002, Hirao et al., 2002]. Here, we have to consider that the oracle summaries, obtained from a reference summary or a set of reference summaries, are not identical at the sentence-level (e.g., the average Jaccard Index between the oracle summaries for the DUC-2004 corpus is around 0.5). The F-measures are varied with the oracle summaries that are used for such computation. For example, assume that we have system summary and oracle summaries and . The precision for is 0.5, while that for is 0.75; the recall for is 0.5, while that for is 1; the F-measure for is 0.5, while that for is 0.86.
Thus, we employ the scores gained by averaging all of the oracle summaries as evaluation measures. Precision, recall, and F-measure are defined as follows: , , .
To demonstrate F-measure’s effectiveness, we investigated the correlation between an F-measure and human judgment based on the evaluation results obtained from the DUC-2004 corpus. The results include summaries generated by 17 systems, each of which has a mean coverage score assigned by a human subject. We computed the correlation coefficients between the average F-measure and the average mean coverage score for 50 topics. Table 6 shows Pearson’s and Spearman’s . In the table, “F-measure (R1)” and “F-measure (R2)” indicate the F-measures calculated using oracle summaries optimized to and , respectively. “M” indicates the F-measure calculated using multiple oracle summaries, and “S” indicates F-measures calculated using randomly selected oracle summaries. “multi” indicates oracle summaries obtained from a set of references, and “single” indicates oracle summaries obtained from a reference summary in the set. For “S,” we randomly selected a single oracle summary and calculated the F-measure 100 times and took the average value with the 95% confidence interval of the F-measures by bootstrap resampling.
The results demonstrate that the F-measures are strongly correlated with human judgment. Their values are comparable with those of . In particular, F-measure (R1) (single-M) achieved the best Spearman’s result. When comparing “single” with “multi,” Pearson’s of “multi” was slightly lower than that of “single,” and the Spearman’s of “multi” was almost the same as those of “single.” “M” has significantly better performance than “S.” These results imply that F-measures based on oracle summaries are a good evaluation measure and that oracle summaries have the potential to be an alternative to human-made reference summaries in terms of automatic evaluation. Moreover, the enumeration of the oracle summaries for a given reference summary or a set of reference summaries is essential for automatic evaluation.
6.2.4 Search Efficiency
To demonstrate the efficiency of our search algorithm against the naive exhaustive search method, we compared the number of feasible solutions (sets of sentences that satisfy the length constraint) with the number of summaries that were checked in our search algorithm. The algorithm that counts the number of feasible solutions is shown in Appendix B.
Table 7 shows the median number of feasible solutions and checked summaries yielded by our method for each data set (in the case of “single”). The differences in the number of feasible solutions between and are very large. Input set () of is much larger than . On the other hand, the differences between and in our method are of the order of to . When comparing our method with naive exhaustive searches, its search space is significantly smaller. The differences are of the order of to with and to with . These results demonstrate the efficiency of our branch and bound technique.
In addition, we show an example of the processing time for extracting one oracle summary and enumerating all of the oracle summaries for the reference summaries in the DUC-2004 corpus with a Linux machine (CPU: Intel*®* Xeon*®* X5675 (3.07GHz)) with 192 GB of RAM. We utilized CPLEX 12.1 to solve the ILP problem. Our algorithm was implemented in C++ and complied with GCC version 4.4.7. The results show that we needed 0.026 and 0.021 sec. to extract one oracle summary per reference summary and 0.047 and 0.031 sec. to extract one oracle summary per set of reference summaries for and , respectively. We needed 11.90 and 1.40 sec. to enumerate the oracle summaries per reference summary and 102.94 and 3.65 sec. per set of reference summaries for and , respectively. The extraction of one oracle summary for a reference summary can be achieved with the ILP solver in practical time and the enumeration of oracle summaries is also efficient. However, to enumerate oracle summaries, we needed several weeks for some topics in DUCs 2005 to 2007 since they hold a huge number of source sentences.
7 Conclusions
To analyze the limitations and the future direction of extractive summarization, this paper proposed (1) Integer Linear Programming (ILP) formulation to obtain extractive oracle summaries in terms of Rougen scores and (2) an algorithm that enumerates all oracle summaries to exploit F-measures that evaluate the sentences extracted by systems.
The evaluation results obtained from the corpora of DUCs 2001 to 2007 identified the following: (1) room still exists to improve the scores of extractive summarization systems even though the scores of the oracle summaries fell below the theoretical upper bound . (2) Over 80% of the reference summaries and from 60% to 90% of the sets of reference summaries have multiple oracle summaries, and the F-measures computed by utilizing the enumerated oracle summaries showed stronger correlation with human judgment than those computed from single oracle summaries.
Appendix A.
Proof.
We can rewrite the right side of equation (9) as follows:
[TABLE]
Here, is defined as follows:
[TABLE]
is the number of times occurs in the multiple set . Equation (14) is rewritten as
[TABLE]
The solutions of equation (15) are obtained by considering the following three conditions:
If and , then 2. 2.
If and , then 3. 3.
If , then
From the above relations,
[TABLE]
Thus,
[TABLE]
∎
Appendix B.
We propose an algorithm to compute the number of feasible solutions under the length constraint by extending the dynamic programming based approach for the subset sum problem [Cormen et al., 2009]. We define , which stores the number of feasible solutions (length is less than ) that can be obtained from set as follows:
- •
Initialization
[TABLE]
- •
Recurrence ()
[TABLE]
Algorithm 4 is a dynamic program that fills out the () table. After the table is filled, each cell on the -th line stores the number of feasible solutions. In the algorithm, first, we pick up the sentences that contain an n-gram that appears in the reference summary at least once and recursively count the number of feasible solutions. Then, the sum of the -th line whose index is from 1 to indicates the number of feasible solutions. The order of the algorithm is .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[Almeida and Martins, 2013] Miguel B. Almeida and Andr e ´ ´ e \acute{\text{e}} F.T. Martins. 2013. Fast and robust compressive summarization with dual decomposition and multi-task learning. In Proc. of the 51st Annual Meeting of the Association for Computational Linguistics , pages 196–206.
- 2[Banerjee et al., 2015] Soddhartha Banerjee, Prasenjit Mitra, and Kazunari Sugiyama. 2015. Multi-document abstractive summarization using ILP based multi-sentence compression. In Proc. of the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015) , pages 1208–1214.
- 3[Bing et al., 2015] Lidong Bing, Piji Li, Yi Liao, Wai Lam, Weiwei Guo, and Rebecca J. Passonneau. 2015. Abstractive multi-document summarization via phrase selection and merging. In Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics , pages 1587–1597.
- 4[Ceylan et al., 2010] Hakan Ceylan, Rada Mihalcea, Umut Özertem, Elena Lloret, and Manuel Palomar. 2010. Quantifying the limits and success of extractive summarization systems across domains. In Proc. of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics , pages 903–911.
- 5[Conroy et al., 2004] John M. Conroy, Jade Goldstein, Judith D. Schlesinger, and Dianne P. O’Leary. 2004. Left-brain/right-brain multi-document summarization. In Proc. of the Document Understanding Conference (DUC) .
- 6[Conroy et al., 2011] John M. Conroy, Judith D. Schlesinger, Jeff Kubina, Peter A. Rankel, and Dianne P. O’Leary. 2011. Classy 2011 at TAC: Guided and multi-lingual summaries and evaluation metrics. In Proc. of the Text Analysis Conference (TAC) .
- 7[Conroy et al., 2013] John M. Conroy, Sashka T. Davis, Jeff Kubina, Yi-Kai Liu, Dianne P. O’Leary, and Judith D Schlesinger. 2013. Multilingual summarization: Dimensionality reduction and a step towards optimal term coverage. In Proc. of the Multi Ling 2013 Workshop on Multilingual Multi-document Summarization , pages 55–63.
- 8[Cormen et al., 2009] Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and Charles E. Leiserson. 2009. Introduction to Algorithms . The MIT Press, 3rd edition.
