Scientific document summarization via citation contextualization and   scientific discourse

Arman Cohan; Nazli Goharian

arXiv:1706.03449·cs.CL·June 13, 2017

Scientific document summarization via citation contextualization and scientific discourse

Arman Cohan, Nazli Goharian

PDF

TL;DR

This paper introduces a novel framework for scientific document summarization that leverages citation contextualization and discourse analysis to generate more accurate and informative summaries, significantly outperforming existing methods.

Contribution

It presents new methods for contextualizing citations and identifying discourse facets, enhancing scientific summarization accuracy across biomedical and computational linguistics domains.

Findings

01

Improved summarization performance over state-of-the-art methods

02

Effective citation contextualization using query reformulation, embeddings, and supervised learning

03

Enhanced understanding of scientific discourse structure

Abstract

The rapid growth of scientific literature has made it difficult for the researchers to quickly learn about the developments in their respective fields. Scientific document summarization addresses this challenge by providing summaries of the important contributions of scientific papers. We present a framework for scientific summarization which takes advantage of the citations and the scientific discourse structure. Citation texts often lack the evidence and context to support the content of the cited paper and are even sometimes inaccurate. We first address the problem of inaccuracy of the citation texts by finding the relevant context from the cited paper. We propose three approaches for contextualizing citations which are based on query reformulation, word embeddings, and supervised learning. We then train a model to identify the discourse facets for each citation. We finally propose a…

Tables14

Table 1. Table 1: Example of similarity values between terms according to the dot product of their corresponding embeddings. Using the pre-trained Word2Vec model on Google News corpus. The top part of the table shows pairs of random words, while the bottom part shows similarity values for pairs of related words.

word 1	word 2	Similarity
marker	mint	0.11
notebook	sky	0.07
capture	promotion	0.12
blue	sky	0.31
produce	make	0.43

Table 2. Table 2: Features for identifying discourse facets.

Feature Name
Citation Text
Extracted Reference Context
Verb Features
Ralative Section Position

Table 3. Table 3: Characteristics of the datasets. #: number of, Avg: average, and Stdev: standard deviation.

Characteristic	TAC	CL-SciSum
# Documents	220	506
# Reference Documents	20	30
Avg. # Citing Docs for each Ref	15.5	15.9
Total # Citation Texts	313	702
Avg. Gold summary length (words)	235.6	134.2
Stdev. Gold summary length (words)	31.2	27.9
Separate train test sets	No	Yes

Table 4. Table 4: Results of citation contextualization on TAC 2014 dataset. The reported results are based on top 10 retrieved contexts. The top part shows the baselines and the bottom part shows our proposed model. Values are percentages. QR-Domain: Query Reformulation by Domain Ontology (UMLS), QR-NP: Query Reformulation by Noun Phrases, QR-KW: Query Reformulation by Key Words, WE w i k i subscript WE 𝑤 𝑖 𝑘 𝑖 \mathrm{WE}_{wiki} : Word Embedding model with Wikipedia embeddings, WE B i o subscript WE 𝐵 𝑖 𝑜 \mathrm{WE}_{Bio} : Word Embedding model with biomedical embeddings, WE B i o + R e t r o f i t subscript WE 𝐵 𝑖 𝑜 𝑅 𝑒 𝑡 𝑟 𝑜 𝑓 𝑖 𝑡 \mathrm{WE}_{Bio+Retrofit} : Incorporating domain knowledge in biomedical embeddings by retrofitting, WE B i o + D o m a i n subscript WE 𝐵 𝑖 𝑜 𝐷 𝑜 𝑚 𝑎 𝑖 𝑛 \mathrm{WE}_{Bio}+Domain : Interpolated language model.

	Character offset overlap			Rouge
Method	$P_{c h a r}$	$R_{c h a r}$	$F_{c h a r}$	Rouge-2	Rouge-3
Baselines
BM25 robertson2009probabilistic	19.5	18.6	17.8	23.2	16.3
VSM	20.5	24.7	21.2	26.4	20.0
LMD zhai2004study	21.3	26.7	22.3	27.2	20.8
LMD + LDA jian2016simple	22.6	24.8	22.3	26.4	20.1
This work
QR-Domain	24.1	23.7	21.8	25.0	20.8
QR-NP	22.6	28.9	23.8	28.0	21.8
QR-KW	22.6	29.4	24.1	28.2	22.2
${WE}_{w i k i}$	21.8	28.5	23.2	26.9	20.9
${WE}_{B i o}$	23.9	31.2	25.5	29.2	23.1
${WE}_{B i o + R e t r o f i t}$	24.8	33.6	26.4	30.7	24.0
${WE}_{B i o} + D o m a i n$	25.4	33.0	27.0	30.6	24.4

Table 5. Table 5: Results of citation contextualization on CL-SciSum 2016 dataset. The reported values are percentages. The top part shows the baselines and state of the art models, while the bottom part shows our methods. P: Precision, R: Recall, F:F1-score. “sent” subscript shows overlap by sentences and “char” subscript shows character offset overlaps. QR-NP: Query Reformulation by Noun Phrases, QR-KW: Query Reformulation by Key Words, WE w i k i subscript WE 𝑤 𝑖 𝑘 𝑖 \mathrm{WE}_{wiki} : Word Embedding model with Wikipedia embeddings, WE w i k i + R e t r o f i t subscript WE 𝑤 𝑖 𝑘 𝑖 𝑅 𝑒 𝑡 𝑟 𝑜 𝑓 𝑖 𝑡 \mathrm{WE}_{wiki+Retrofit} : Incorporating domain knowledge in embeddings by retrofitting

	Sentence overlap			Rouge		Character offset overlap
Method	$P_{s e n t}$	$R_{s e n t}$	$F_{s e n t}$	Rouge-2	Rouge-3	$P_{c h a r}$	$R_{c h a r}$	$F_{c h a r}$
Other methods
BM25 robertson2009probabilistic	8.2	18.0	10.5	15.2	13.0	9.0	19.9	11.8
VSM	8.3	22.3	11.6	14.8	12.7	8.5	25.7	12.1
LM zhai2004study	7.9	24.8	11.6	14.3	12.6	8.4	26.1	12.2
TSR klampfl2016identifying	5.3	4.7	5.0	-	-	-	-	-
Tf-idf + Neural Net Nomoto2016NEALAN	9.2	11.1	10.0	-	-	-	-	-
SVM Rank Cao2016PolyUAC	8.8	13.1	10.3	-	-	-	-	-
Jaccard Fusion Li2016CISTSF	8.3	26.1	12.5	-	-	-	-	-
Tf-idf+stem moraes2016university	9.6	22.4	13.4	-	-	-	-	-
This work
QR-NP	8.8	20.4	12.2	15.8	13.6	9.7	23.8	13.2
QR-KW	9.0	21.3	12.6	16.0	13.8	9.6	23.3	13.0
WE_wiki	9.8	24.1	13.9	14.5	12.5	9.4	22.1	12.5
WE_{wiki+Retrofit}	9.8	23.8	13.8	14.7	13.6	8.2	22.3	12.0
Supervised	11.3	17.8	13.7	17.5	15.0	12.0	17.8	13.7

Table 6. Table 6: The words with highest similarity values to “expression” according to Word2Vec trained on Wikipedia (general domain) and Genomics collections (biomedical domain).

General

(Wiki)

Domain-specific

(Bio)

interpretation

upregulation

sense

mrna

emotion

protein

function

induction

show

cell

Table 7. Table 7: The weights (normalized) corresponding to the top features in the supervised method for citation contextualization (CL-SciSum dataset). Tf-idf similarity based features and embedding based features are the most helpful while the count based similarity and word matching features are among the least helpful features.

Feature	weight
character n-gram tf-idf similarity	0.271
tf-idf similarity	0.201
embedding based alignment	0.189
distance average embeddings	0.106
bm25 similarity score	0.066
character n-gram count similarity	0.035
fuzzy word match	0.024
count based similarity	0.015
word match	0.013

Table 8. Table 8: The table shows the number of citations grouped by the number of annotators that agree at least partially on the context.

Number of Citations

Number of Annotators with at

least partial agreement

68

4

66

3

121

2

11

No agreement

Table 9. Table 9: Results for identifying the discourse facets for the retrieved contexts. The metrics are Precision (P), Recall (R), and F1-score (F) of the identified discourse facets contingent on the correct retrieved span.

Method	P	R	F
Other methods
SMO saggion2016trainable	35.6	3.6	6.5
Decision tree Cao2016PolyUAC	59.7	9.0	15.3
Fusion method Li2016CISTSF	52.8	22.4	29.6
Jaccard cascade Li2016CISTSF	58.2	17.1	25.5
Jaccard Focused Method Li2016CISTSF	57.8	22.8	31.1
This work
QR-NP	76.3	19.1	29.7
QR-KW	78.7	21.9	33.3
WE_wiki	82.7	22.4	33.1
WE_wiki+retro	81.7	23.4	34.8
Supervised	83.1	23.7	36.1

Table 10. Table 10: The classifier’s intrinsic performance for identifying the discourse facets on the CL-SciSum dataset.

Discourse Facet	P	R	F	#
Aim	0.93	0.36	0.52	36
Hypothesis	1.00	0.20	0.33	10
Implication	0.85	0.26	0.39	43
Method	0.79	0.98	0.87	250
Results	0.85	0.38	0.52	45
Average/Total	0.82	0.75	0.73	384

Table 11. Table 11: Effect of learning algorithms in identifying the discourse facets. SVM: Support Vector Machine with Linear Kernel, RF: Random Forest, LR: Logistic Regression, Oracle: Highest achievable score. Numbers are weighted accuracy scores by annotators.

	SVM	RF	LR	Oracle
TAC	0.53	0.49	0.51	0.67
CL-SciSum	0.67	0.64	0.66	-

Table 12. Table 12: Summarization results on the CL-SciSum dataset. Metrics are Rouge F-scores. The top part shows the baselines and the state-of-the-art systems. Bottom systems show our method variants based on different contextualization approaches and sentence selection strategy from the discourse facets. iter (iterative) and greedy refer to the sentence selection approach for the final summary.

	Rouge-2	Rouge-3	Rouge-SU4
LexRank erkan2004	11.8	8.1	11.4
CLexRank qazvinian2008scientific	5.7	3.3	8.9
SumBasic vanderwende2007beyond	8.5	3.8	11.5
SUMMA saggion2016trainable	13.4	-	9.2
LMKL conroy2015vector	19.0	-	11.1
LMeq conroy2015vector	18.9	-	12.4
CIST Li2016CISTSF	21.9	-	13.6
QR-KW-iter	27.6	21.4	23.4
QR-KW-greedy	28.9	22.5	24.9
QR-NP-iter	23.0	20.9	22.6
QR-NP-greedy	30.2	23.9	25.7
WE_wiki-iter	22.4	15.9	21.7
WE_wiki-greedy	23.6	18.0	20.1
supervised-iter	24.1	18.5	20.8
supervised-greedy	23.6	18.3	19.6

Table 13. Table 13: Summarization results on the TAC dataset. Metrics are Rouge F-scores. The top part shows the baselines and the state-of-the-art systems. Bottom systems show our method variants based on different contextualization approaches and the greedy sentence selection strategy.

	Rouge-2	Rouge-3	Rouge-SU4
LexRank erkan2004	12.8	5.0	17.5
CLexRank qazvinian2008scientific	8.9	3.9	8.3
SumBasic vanderwende2007beyond	8.3	4.2	12.5
QR-NP	15.8	6.9	20.4
QR-Domain	13.2	5.2	18.1
QR-KW	15.0	6.6	19.8
WE_wiki	13.3	5.5	17.8
WE_Bio	13.1	4.9	18.0
WE_Bio+Retrofit	14.4	5.7	19.5
WE_Bio+Domain	13.4	5.9	20.7

Table 14. Table 14: The effect of discourse facets on the summarization results on the TAC and CL-SciSum dataset based on QR-NP approach by greedy sentence selection strategy on the identified facets. Other approaches show similar positive trends. Metrics are Rouge F-scores.

	R-2	R-3	R-SU4
TAC – QR-NP (no facet)	13.5	5.3	19.3
TAC – QR-NP (faceted)	15.8	6.9	20.4
CL-SciSum – QR-NP (no facet)	19.4	17.2	22.6
CL-SciSum – QR-NP (faceted)	30.2	23.9	25.7

Equations20

p (d ∣ q) \propto p (q ∣ d) = i = 1 \prod n p (q_{i} ∣ d)

p (d ∣ q) \propto p (q ∣ d) = i = 1 \prod n p (q_{i} ∣ d)

p (q_{i} ∣ d) = \frac{f ( q _{i} , d ) + μ p ( q _{i} ∣ C )}{\sum _{w \in V} f ( w , d ) + μ}

p (q_{i} ∣ d) = \frac{f ( q _{i} , d ) + μ p ( q _{i} ∣ C )}{\sum _{w \in V} f ( w , d ) + μ}

p (q_{i} ∣ d) = \frac{\sum _{d j \in d} s ( q _{i} , d _{j} ) + μ p ( q _{i} ∣ C )}{\sum _{w \in V} \sum _{d_{j} \in d} s ( w , d _{j} ) + μ}

p (q_{i} ∣ d) = \frac{\sum _{d j \in d} s ( q _{i} , d _{j} ) + μ p ( q _{i} ∣ C )}{\sum _{w \in V} \sum _{d_{j} \in d} s ( w , d _{j} ) + μ}

s(q_{i},d_{j})=\begin{cases}{\phi}\big{(}e(q_{i}),e(d_{j})\big{)},&\text{if }e(q_{i}).e(d_{j})>\tau\\ 0,&\text{otherwise}\end{cases}

s(q_{i},d_{j})=\begin{cases}{\phi}\big{(}e(q_{i}),e(d_{j})\big{)},&\text{if }e(q_{i}).e(d_{j})>\tau\\ 0,&\text{otherwise}\end{cases}

ϕ (x) = lo g (\frac{x}{1 - x})

ϕ (x) = lo g (\frac{x}{1 - x})

p (q_{i} ∣ d) = λ p_{1} (q_{i} ∣ d) + (1 - λ) p_{2} (q_{i} ∣ d)

p (q_{i} ∣ d) = λ p_{1} (q_{i} ∣ d) + (1 - λ) p_{2} (q_{i} ∣ d)

s_{2} (q_{i}, d_{j}) = ⎩ ⎨ ⎧ 1, γ, 0, if q_{i} = d_{j} if q_{i} is-syn d_{j} o.w.

s_{2} (q_{i}, d_{j}) = ⎩ ⎨ ⎧ 1, γ, 0, if q_{i} = d_{j} if q_{i} is-syn d_{j} o.w.

f (S_{1}, S_{2}) = \frac{\sum _{w \in S_{1}} max _{v \in S_{2}} s ( w , v )}{∣ S _{1} ∣}

f (S_{1}, S_{2}) = \frac{\sum _{w \in S_{1}} max _{v \in S_{2}} s ( w , v )}{∣ S _{1} ∣}

P_{c ha r} = \frac{\sum _{i}^{m} ∣ S \cap R _{i} ∣}{m \times ∣ S ∣}

P_{c ha r} = \frac{\sum _{i}^{m} ∣ S \cap R _{i} ∣}{m \times ∣ S ∣}

R_{c ha r} = \frac{\sum _{i}^{m} ∣ S \cap R _{i} ∣}{\sum _{i}^{m} ∣ R _{i} ∣}

R_{c ha r} = \frac{\sum _{i}^{m} ∣ S \cap R _{i} ∣}{\sum _{i}^{m} ∣ R _{i} ∣}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

∎

11institutetext: Arman Cohan 22institutetext: 22email: [email protected] 33institutetext: Nazli Goharian 44institutetext: 44email: [email protected] 55institutetext: 1 Information Retrieval Lab, Department of Computer Science, Georgetown University, Washington DC, USA

Scientific document summarization via citation contextualization and scientific discourse ††thanks:

∗ This is a pre-print of an article published on IJDL. The final publication is available at Springer via http://dx.doi.org/10.1007/s00799-017-0216-8

Arman Cohan 1

Nazli Goharian 1

(Received: date / Accepted: date)

Abstract

The rapid growth of scientific literature has made it difficult for the researchers to quickly learn about the developments in their respective fields. Scientific document summarization addresses this challenge by providing summaries of the important contributions of scientific papers. We present a framework for scientific summarization which takes advantage of the citations and the scientific discourse structure. Citation texts often lack the evidence and context to support the content of the cited paper and are even sometimes inaccurate. We first address the problem of inaccuracy of the citation texts by finding the relevant context from the cited paper. We propose three approaches for contextualizing citations which are based on query reformulation, word embeddings, and supervised learning. We then train a model to identify the discourse facets for each citation. We finally propose a method for summarizing scientific papers by leveraging the faceted citations and their corresponding contexts. We evaluate our proposed method on two scientific summarization datasets in the biomedical and computational linguistics domains. Extensive evaluation results show that our methods can improve over the state of the art by large margins.

††journal: International Journal on Digital Libraries

1 Introduction

The rapid growth of scientific literature in recent decades has created a challenge for researchers in various fields to keep up with the newest developments. According to a recent study by bibliometric analysts, the global scientific output doubles approximately every nine years bornmann2015growth , further signifying this challenge. Existence of surveys in different fields shows that finding an overview of key developments in scientific areas is desirable, however procuring such surveys requires extensive human efforts. Scientific summarization aims at addressing this problem by providing a concise representation of important findings and contributions of scientific papers, reducing the time required to overview the entire paper to understand important contributions. Article abstracts are a basic form of scientific summaries. While abstracts provide an overview of the paper, they do not necessarily convey all the important contributions and impacts of the paper elkiss2008blind : (i) The authors might ascribe contributions to their papers that are not existent. (ii) some important contributions might not be included in the abstract. (iii) the contributions stated in the abstract do not convey the article’s impact over time. (iv) abstracts usually provide a very broad view of the papers and they may not be detailed enough for people seeking detailed contributions. (v) The content distribution in the abstracts are not evenly drawn from different sections of the papers atanassova2016composition . These problems have inspired another type of scientific summaries which are obtained by utilizing a set of citations referencing the original paper qazvinian2008scientific ; qazvinian2013generating . Each citation, is often accompanied by a short description, explaining the ideas, methods, results, or findings of the cited work. This short description is called citation text or citance nakov2004citances . Therefore, a set of citation texts by different papers can provide an overview of the main ideas, methods and contributions of the cited paper, and thus, can form a summary of the referenced paper. These community based summaries capture the important contributions of the paper, view the article from multiple aspects, and reflect the impact of the article to the community.

At the same time, there are multiple problems associated with citation texts. They are written by different authors so they may be biased toward another work. The citation texts lack the context in terms of the details of the methods, the data, assumptions, and results. More importantly, the points and claims by the original paper might be misunderstood by the citing authors; certain contributions might be ascribed to the cited work that are not on par with the original author’s intent. Another serious problem is the modification of the epistemic value of claims, which states that many claims by the original author might be stated as facts in the future citations deWaard2012epistemic . An example of this is shown in Figure 1. As illustrated, while the original authors write on some possibilities, later the citing authors state them as known facts. These problems are even more serious in biomedical domain where slight misrepresentations of the specific findings about treatments, diagnosis, and medications, could directly affect human lives.

One way to address such problems is to consider the citations in their context from the reference article. Therefore, citation texts should be linked to the specific parts in the reference paper that correctly reflect them. We call this “citation contextualization”. Citation contextualization is a challenging task due to the terminology variations between the citing and cited author’s language usage.

Scientific papers have the unique characteristic of following a specific discourse structure. For example, a typical scientific discourse structure follows this form: problem and motivation, methods, experiments, results, and implications. The rhetorical status of a citation provides additional useful information that can be used in applications such as information extraction, retrieval, and summarization Teufel2006 . Each citation text could refer to specific discourse facets of the referenced paper. For example one citation could be about the main method of the referenced paper while the other one could mention their results. Identifying these discourse facets has distinct values for scientific summarization; it allows creating more coherent summaries and diversifying the points included in the generated scientific summaries.

Scientific summarization is recently further motivated by TAC111Text Analysis Conference, http://tac.nist.gov/2014/BiomedSumm/ 2014 summarization track, and the 2016 computation linguistics summarization shared task jaidka2016overview . Following these works and motivated by the challenges mentioned above, we propose a framework for scientific summarization based on citations. Our approach consists of the following steps:

•

Contextualizing citation texts: We propose several approaches for contextualizing citations. Finding the exact reference context for the citations is challenging due to discourse variation and terminology differences between the citing and the referenced authors. Therefore, traditional Information Retrieval (IR) methods are inadequate for finding the relevant contexts. We propose to address this challenge by query reformulations, utilizing word embeddings bengio2003neural , and domain-specific knowledge. Our main approach is a retrieval model for finding the appropriate context of the citations and is designed to handle terminology variations between the citing and cited authors.

•

Discourse structure: After extracting the context of the citation texts, we classify them into different discourse facets. We use a linear classifier with variety of features for classifying the citations.

•

Summarization: We propose two approaches for summarizing the papers. Both approaches are based on summarization through the scientific community where the main points of a paper are captured by a set of given citations. Our approach extends the previous works on citation-based summarization Qazvinian2013 ; qazvinian2008scientific ; qazvinian2010identifying by including the reference context to address the inaccuracy problem associated with the citation texts. After extracting the citation contexts from the reference paper, we group them into different discourse facets. Then using the most central sentences in each group, we generate the final summary.

In particular our contributions are summarized as follows: (i) An approach for extracting the context of the citation texts from the reference article. (ii) Identifying the discourse facets of the citation contexts. (iii) A scientific summarization framework utilizing citation contexts and the scientific discourse structure. (iv) Extensive evaluation on two scientific domains.

2 Related work

2.1 Citation text analysis

Citations play an integral role in the scientific development. They help disseminate the new findings and they allow new works to be grounded on previous efforts hernandez2016survey . While there is a large body of related work on analysis of citation networks, instead of link analysis, we focus on textual aspects of the citations. To better utilize the citations, researchers have explored ways to extract citation texts, which are short textual parts describing some aspects of the cited work. Examples of the proposed approaches for extracting the citation texts include jointly modeling the link information and the citation texts kataria2010utilizing , supervised Markov Random Fields classifiers qazvinian2010identifying , and sequence labeling with segment classification abu2012reference . These approaches focus on finding the sentences or textual spans in the citing article that explain some aspects of the cited work. In this work, we assume that citation texts are already obtained either manually or by using one of these works. Given the citation texts, we instead focus on contextualizing these citation texts using the reference; we find the text spans in the reference article that most closely reflect the citation text.

There exists some related work on further analyzing the citations for finding their function or rhetorical status Teufel2006 ; garzone2000towards ; abu2013purpose ; hernandez2016survey . In these works, the authors tried to identify the reasons behind citations which can be a statement of weakness, contrast or comparison, usage or compatibility, or a neutral category. They proposed a classification framework based on lexically and linguistically inspired features for classifying citation functions. The distribution of citations within the structure of scientific papers have been also studied Bertin2016 . The authors of chakraborty-narayanam:2016:EMNLP2016 have investigated the problem of measuring the intensity of the citations in scientific papers and in chakraborty2016ferosa , the authors proposed using the discourse facets for scientific article recommendation. Recently, a framework for understanding citation function has been proposed jurgens2016citation which unifies all the previous efforts in terms of definition of citation functions. While citation function can provide additional information for summarization, in this work we do not utilize these information. Instead, we utilize the discourse facet of the citation contexts in a reference paper.

2.2 Citation contextualization

More recently, there has been some efforts in contextualizing citations from the reference. In particular, TAC 2014 summarization track,222http://tac.nist.gov/2014/BiomedSumm/ and the CL-SciSumm 2016 shared task on computational linguistic summarization jaidka2016overview have released datasets to promote research for citation contextualization. The former is more domain specific, focusing on biomedical scientific literature, while the latter is in a more general domain consisting of publications in computational linguistics. To our knowledge, there is no overview paper on TAC. We briefly discuss the successful approaches in CL-SciSumm 2016. The authors of Cao2016PolyUAC used an SVM-rank approach with features such as tf-idf333Term Frequency - Inverted Document Frequency. cosine similarity, position of the reference sentence, section position, and named entity features. In another approach Li2016CISTSF , the authors used an SVM classifier with sentence similarity and lexicon based features. The authors of Nomoto2016NEALAN proposed a hybrid model based on tf-idf similarity and a single layer neural network that scores the relevant reference texts above the irrelevant ones. Finally, in the work by klampfl2016identifying , the authors proposed the use of TextSentenceRank algorithm which is an enhanced version of the TextRank algorithm for ranking keywords in the documents. Here, we specifically focus on the problem of terminology variation between the citing and cited authors. We propose approaches that address this problem. Our proposed approaches are based on query reformulations, word embeddings, and domain-specific knowledge.

2.3 Text summarization

Document summarization has been an active research area in NLP in recent decades; there is a rich literature on text summarization. Approaches towards summarization can be divided into the following categories: (i) topic modeling based gong2001lsa ; steinberger2004lsa ; vanderwende2007beyond ; celikyilmaz2010hybrid : In these approaches, the content or topical distribution of the final summary is estimated using a probabilistic framework. (ii) solving an optimization problem Clarke:2008 ; berg2011jointly ; DurrettBergKlein2016 : these approaches cast the summarization problem as an optimization problem where an objective function needs to be optimized with respect to some constraints. (iii) supervised models osborne2002using ; conroy2011classy ; Chali:2012:QMS:2139643.2139649 , where selection of sentences in the summary are learned using a supervised framework. (iv) graph based erkan2004 ; mihalcea2004 ; Paul:2010 : these approaches seek to find the most central sentences in a document’s graph where sentences are nodes and edges are similarities. (v) Heuristic based carbonell1998use ; guo2010probabilistic ; lin2010putting : these works approach the summarization problem by greedy selection of the content. (vi) Neural networks: More recently, there has been some efforts on utilizing neural networks and sequence-to-sequence models sutskever2014sequence for generating summaries of short texts and sentences rush-chopra-weston:2015:EMNLP ; chopra-auli-rush:2016:N16-1 . Most of these works have focused on general domain summarization and news articles. Scientific articles are much different than news articles in elements such as length, language, complexity and structure Teufel:2002 .

One of the first works in scientific article summarization is done by Teufel:2002 where the authors trained a supervised Naive Bayes classifier to select informative content for the summary. Later, the impact of citations to generate scientific summaries was realized elkiss2008blind . In the work by Qazvinian2013 , the authors proposed an approach for citation-based summarization based on a clustering approach, while in abu2011coherent and jha2015surveyor , the focused on producing coherent scientific summaries. We argue that citation texts by themselves are not always accurate and they lack the context of the cited paper. Therefore, if we only use the citation texts for scientific summarization, the resulting summary would potentially suffer from the same problems, and it might not accurately reflect the claims made in the original paper. We address this problem by leveraging the citation contexts from the reference paper. We also utilize the inherent discourse structure of the scientific documents to capture the important content from all sections of the paper.

We present a comprehensive framework for scientific summarization which utilizes and builds upon our earlier efforts Cohan2015 ; cohan-goharian:2015:EMNLP ; cohan2017contextualizing . We propose new approaches for citation contextualization. We further extend our experiments on an additional dataset (CL-SciSum 2016) and evaluate our approaches on both TAC and CL-SciSum datasets, providing detailed analysis.

3 Methodology

Our proposed method is a pipeline for summarizing scientific papers. It consists of the following steps:

citation contextualization (extracting the relevant context from the reference paper) 2. 2.

identifying the discourse facet of the extracted context 3. 3.

summarization

We first explain our proposed methods for contextualization, we then describe our approach for identifying discourse facets of the citation contexts, and finally we outline our summarization approach.

3.1 Citation contextualization

Citation contextualization refers to extracting the relevant context from the reference article for a given citation text. We propose the following three approaches for this problem: (i) Query reformulation, (ii) Word embeddings and domain knowledge, and (iii) Supervised classification.

3.1.1 Query reformulation (QR)

We cast the contextualization problem as an Information Retrieval (IR) task. We first extract textual spans from the reference article and index them using an IR model. The textual spans are of granularity of sentences. In order to capture longer contexts (those consisting of multiple consecutive sentences), we also index sentence n-grams. That is, we index each n consecutive sentences as a separate text span.444we indexed up to 3 consecutive sentences in our experiments. After constructing the index, we consider the citation text as the query, and we seek to find the relevant context from the indexed spans. Since the citation texts are often longer than usual queries in standard IR tasks, we apply query reformulation methods on the citation to better retrieve the related context. We utilize both general and domain-specific query reformulations for this purpose. We first remove the citation markers (author names and year, and numbered citations) from the citations, as they do not appear in the reference text and hence are not helpful. We design several regular expressions to capture these names. The proposed query reformulation (QR) methods are described below:

Query reduction

Since the citation texts are usually more verbose than standard queries, there might be many uninformative terms in them that do not contribute in finding the correct context. Hence, we apply query reduction methods to only retain the important concepts in the citation. After removing the stop words from the citation, we further experiment with the following three query reduction methods:

Noun phrases (QR-NP). Citation texts are usually linguistically well-formed, as they are extracted from scientific papers. This allows us to apply a variety of linguistic tagging and chunking methods to the query to capture the informative phrases. Previous works have shown that noun phrases are good representation of informative concepts in the query bendersky2008discovering ; huston2010evaluating ; hulth2003improved . We thus extract noun phrases from the citation text and omit all other terms. 2. 2.

Key concepts (QR-KW). Key concepts or keywords are single or multi-word expressions that are informative in finding the relevant context. We use the Inverted Document Frequency (IDF) sparck1972statistical measure to find the key concepts. The terms that are prevalent throughout all the text spans do not provide much information in retrieval. IDF values help capturing the terms and concepts that are more specific. For key concept extraction, we limit the IDF values between some threshold that can be tuned according to the dataset.555We empirically set this threshold to 1.9 and 2.2 for the TAC and CL-SciSum datasets, respectively. We consider phrases of up to three terms. 3. 3.

Ontology (QR-Domain). Domain-specific ontologies are expert curated lexicons that contain domain-specific concepts. In this reformulation method, we use an ontology to only keep important (domain-specific) concepts in the query. Since the TAC dataset is in the biomedical domain, we use the UMLS bodenreider2004unified thesaurus which is a comprehensive ontology of biomedical concepts. We specifically use the SNOMED CT snomed2011systematized subset of UMLS.

As explained in Section 3.1.1, the indexing approach also contains consecutive sentences. Therefore, our retrieval approach can find text spans that have overlaps with each other. Furthermore, retrieving multiple spans from around the same location in the text signals the importance of that specific location. We apply a reranking and merging method to the retrieved spans to remove shared spans and better rank the more relevant context. We merge the two overlapping spans if the retrieval score of the larger span is higher than the smaller span. We also evaluated other query reformulation methods such as Pseudo Relevance Feedback cao2008selecting ; however, they performed worse than the baseline and thus we do not discuss them further.

3.1.2 Contextualization using word embeddings and domain knowledge

To explicitly account for terminology variations and paraphrasing between the citing and the cited authors, we propose another model for citation contextualization utilizing word embeddings and domain-specific knowledge.

Embeddings.

Word embeddings or distributed representations of words are mapping of words to dense vectors according to a distributional space, with the goal that similar words will be located close to each other bengio2013representation . We extend the Language Modeling (LM) for information retrieval model ponte1998language by utilizing word embeddings to account for terminology variations. Given a citation text (query) $q$ , and a reference span (document) $d$ , the LM scores $d$ based on the probability that $d$ has generated $q$ ( $p(d|q)$ ). Using standard simplifying assumptions of term independence and uniform document prior, we have:

[TABLE]

where $q_{i}\;(i=1,...,n)$ are the terms in the query. In LM with Dirichlet Smoothing zhai2004study , $p(q_{i}|d)$ is calculated using a smoothed maximum likelihood estimate:

[TABLE]

where $f$ is the frequency function, $p(q_{i}|C)$ shows the background probability of term $q_{i}$ in collection $C$ , $V$ is the entire vocabulary, and $\mu$ is the Dirichlet parameter.

Our model extends the above formulation (Eq. 2) by using word embeddings. In particular we estimate the probability $p(q_{i}|d)$ according to the following equation:

[TABLE]

where $d_{j}$ are terms in the document $d$ , and $s$ is a function that captures the similarity between the terms and is defined as:

[TABLE]

where $e(q_{i})$ shows the unit vector corresponding to the embedding of word $q_{i}$ , $\tau$ is a threshold, and $\phi$ is a transformation function. Below we explain the role of parameter $\tau$ and the transformation function $\phi$ .

Word embeddings can capture the similarity values of words according to some distance function. Most embedding methods represent the distance in the distributional semantics space. Therefore, similarities between two words $q_{i}$ and $d_{j}$ can be captured using the dot product of their corresponding embeddings (i.e. $e(q_{i}).e(d_{j})$ ). While high values of this product suggest syntactic and semantic relatedness between the two terms mikolov2013distributed ; pennington2014glove ; hill2015simlex , many unrelated words have non-zero dot products (an example is shown in Table 1). Therefore, considering them in the retrieval model introduces noise and hurts the performance. We address this issue by first considering a threshold $\tau$ below which all similarity values are squashed to zero. This ensures that only highly relevant terms contribute to the retrieval model. To identify an appropriate value for $\tau$ , we select a random set of words from the embedding model and calculate the average and standard deviation of point-wise absolute values of similarities between the pairs of terms from these samples. We then set $\tau$ to be two standard deviations larger than the average similarities, to only consider very high similarity values. We also observe that for high similarity values between the terms, the values are not discriminative enough between more or less related words. This is illustrated in Figure 2 where we can see that the most similar terms to the given term are not too discriminative. In other words, the similarity values decline slowly as moving away from top similar words. We instead want only very top similar words to contribute to the retrieval score. Therefore, we transform the similarity values according to a logit function (equation 5) to dampen the effect of less similar words (see Figure 2):

[TABLE]

While any approach for training the word embeddings could be used, we use the Word2Vec le2014distributed method, which has proven effective in several word similarity tasks. We train Word2Vec on the recent dump of Wikipedia.666https://dumps.wikimedia.org/enwiki/ Since the TAC dataset is in biomedical domain, we also train embeddings on a domain-specific collection; we use the TREC Genomics collections, 2004 and 2006 hersh2009genomics which together consist of 1.45 billion tokens.

Incorporating domain knowledge

Word embedding models learn the relationship between terms by being trained on a large corpus. They are based on the distributional hypothesis harris1954distributional which states that similar words appear in similar contexts. While these models have been very successful in capturing semantic relatedness, recent related works have shown that domain ontologies and expert curated lexicons may contain information that are not captured by embeddings mrkvsic2016counter ; hill2015simlex ; faruqui2015retrofitting ; hence, we account for the domain knowledge according to the following.

•

Retrofitting embeddings: In this method, we apply a post-processing step called retrofitting faruqui2015retrofitting to the word embeddings used in the model. Retrofitting optimizes an objective function that is based on relationships between words in a lexicon; it intuitively pulls closer the words that are related to each other and pushes farther the words that are not related to each other according to a given ontology. For the ontology, since TAC data is in biomedical domain we use two domain-specific ontologies, Mesh777MEdical Subject Headings lipscomb2000medical and Protein Ontology (PRO).888http://pir.georgetown.edu/pro/ For the CL-SciSum data, since it is less domain-specific, we use the WordNet lexicon miller1995wordnet .

•

Interpolating in the LM: In this method, instead of modifying the word vectors, we incorporate the domain knowledge directly in the retrieval model. We do so by interpolation of two following probability estimates:

[TABLE]

where $p_{1}$ is estimated using Eq. 3 and $p_{2}$ is a similar model that counts in the is-synonym relations ( $\operatorname{is-syn}$ ) in calculating similarities. Its formulation is exactly like Eq. 3 except it replaces the function $s$ with the following function:

[TABLE]

This function is essentially partially counting the synonyms in calculation of the probability estimate $p(q_{i}|d)$ by the amount of $\gamma$ . We empirically set the value of $\gamma$ . Word embedding based methods are shown by WE in short in the results.

3.1.3 Supervised classification

The two previous context retrieval models are unsupervised and as such do not take advantage of the already labeled data. CL-SciSum dataset includes separate training and testing sets which allow us to also investigate supervised approaches. We propose a feature-rich classifier to find the correct context for each given citation. Our approach aims to capture the semantic relatedness between a given citation text and a candidate context sentence. We specifically utilize the following features to capture this relatedness:

•

Word match: counts the number of identical words between the source citation text and the candidate reference context normalized by length.

•

Fuzzy word match: same as above, with the difference that we use character n-grams to capture partial matches between the words.

•

Embedding-based alignment: measures the similarity between the source and target sentences using word embedding alignment. Specifically for the two sentences $S_{1}$ and $S_{2}$ , the following function $f$ scores the sentences based on their similarity:

[TABLE]

where $s$ is a similarity function according to the equation 4. Intuitively, $f$ captures the similarity between the two sentences without only relying on lexical overlaps; it takes into account the similarity values between the terms.

•

Distance between average of embeddings: measure the similarity between the two sentences by dot product of the average of their constituent word vectors.

•

BM25 similarity score robertson2009probabilistic between the citation text and the candidate reference.

•

Tf-idf and count vectorized similarities: dot product between the sparse tf-idf weighted or count weighted vectors associated with the source citation and target reference context.

•

Character n-gram Tf-idf and count vectorized similarities: same as above, except that we used 3-gram characters to allow partial word matches.

We train a standard linear classifier (e.g. Logistic Regression) using these features to identify the correct context for a given citation text.

3.2 Identifying discourse facets

The organization of scientific papers usually follows a standardized discourse pattern, where the authors first describe the problem or motivation, then they talk about their methods, then the results, and finally discussion and implications. Our goal is to capture the important content from all sections of the paper; therefore, after extracting the citation contexts, we identify the associated discourse facet for each of the citation contexts retrieved from the previous step. Each citation context refers to some specific discourse facets of the reference document. To identify the correct discourse facets, we train a simple supervised model with features listed in Table 2. Essentially, we use the citation text and the extracted reference context represented by character n-grams, the verbs in the context sentence, and the relative position of the retrieved context in the paper as features for the classifier. While the textual features (citation and it’s context) were the most helpful, we empirically observed slight improvements by incorporating the verb and section position features. We train the model using an SVM classifier wang2012baselines . For the textual features, we transform them using character n-grams to allow fuzzy matching between the terms.

3.3 Generating the summary

After extracting reference contexts for the citations as described in Section 3.1, and identifying their discourse facet (Section 3.2), we generate a summary of the reference paper. Our goal is to create a summary that contains information from different discourse facets of the paper. This helps not only in diversifying the content in the summary, but also in creating a more coherent summary. To generate a summary, we first identify the most representative sentences in each group. Intuitively, we only need a few top representative sentences from each discourse facet to include in the summary. In order to find the most representative sentences, we consider sentences in each facet as nodes and their similarities as weighted edges in a graph. We then apply the “power method” erkan2004lexrank which is an algorithm similar to the PageRank random walk ranking model page1999pagerank , that finds the most central nodes in a graph. It works by iteratively updating the score of each sentence according to its centrality (total weight of incoming edges) and the centrality of its neighbors. After ranking the sentences in each group according to their centrality score, we select sentences for the final summary. We use the following methods for creating the final summary:

•

Iterative. This method simply iterates over the discourse facets and selects the top representative sentence from each group until the summary length threshold is met.

•

Greedy. The iterative approach could result in similar sentences ending up in the summary; this results in redundant information and potential exclusion of other important aspects of the paper from the summary. To address this potential problem, we use a heuristic that accounts for both the informativeness of candidate sentence and their novelty with respect to what is already included in the summary. Maximal Marginal Relevance carbonell1998use is one such heuristic that has these properties. It is based on the linear interpolation of the informativeness and the novelty of the sentences.

4 Experiments

4.1 Data

We conducted our experiments on two scientific summarization datasets. The first dataset is the TAC 2014 scientific summarization dataset.999http://tac.nist.gov/2014/BiomedSumm/ The TAC benchmark is in biomedical domain and is publicly available upon request from NIST.101010National Institute of Standards and Technology The second dataset is the 2016 CL-SciSumm dataset jaidka2016overview which is available on a public repository111111https://github.com/WING-NUS/scisumm-corpus and contains scientific articles from the computational linguistics domain. To our knowledge, these two are the only datasets on scientific summarization.

The TAC dataset only has one training set consisting of 20 topics. There is one reference article in each topic and another set of articles citing the reference. For each topic, 4 annotators have identified the relevant contexts, the correct discourse facet, and they have written a summary. The documents are provided as plain text files and there is no predefined sentence boundaries and sections. On the other hand, the CL-SciSumm data contain separate train, development, and test sets with 30 topics in total. Similar to TAC, each topic consists of reference and a set of citing articles but in the computational linguistics domain. The articles are in xml format with known sentence boundaries and sections. Another distinction is that topics in the CL-SciSumm data are annotated by one annotator at a time. The full statistics of the datasets is illustrated in Table 3. The distribution of the discourse facets in the two datasets is also shown in Figure 3. Since the two datasets are in different domains, the difference between the distribution of the facets is expected.

4.2 Citation contextualization

Evaluation

Evaluation of the retrieved contexts is based on the overlap of the position of the retrieved contexts and the gold standard contexts. Per TAC guidelines121212http://tac.nist.gov/2014/BiomedSumm/guidelines.html, evaluation of the TAC benchmark was performed using character offset overlaps weighted by human annotators. More formally, for a set of system retrieved contexts $S$ , and gold standard context $R=\{R_{1}\cup R_{2}\cup....\cup R_{m}\}$ by $m$ annotators, the weighted character based precision ( $P_{char}$ ) and recall ( $R_{char}$ ) are defined as follows:

[TABLE]

The official metric for the CL-SciSum challenge was sentence level overlaps of the retrieved contexts with the gold standard. This was possible because unlike the articles in TAC which were in plaintext format, the sentence boundaries in CL-SciSum were pre-specified. We also report character level metrics for the CL-SciSum corpus; as we will see, the character level and sentence level metrics are more or less comparable.

One problem with position based evaluation metrics (character, or sentence) is that a system might retrieve a context that is in a different position than gold standard, but similar to the content of the gold standard. In such cases, the system is not rewarded at all. This is possible because authors might talk about a similar concept in different sections of the paper. To consider textual similarities of the retrieved context with the gold standard, we also compute Rouge-N scores lin2004rouge .

Comparison

To our knowledge, no review paper about the TAC challenge was released. Hence, for the TAC dataset, we compare our method against the following baselines:

•

VSM. Ranking by Vector Space Model (VSM) with tf-idf weighting of the citations and the target reference contexts.

•

BM25. BM25 scoring model jones2000probabilistic which is a probabilistic framework for ranking the relevant documents based on the query terms appearing in each document, regardless of their relative proximity.

•

LMD. Language modeling with Dirichlet smoothing (LMD) zhai2004study is a probabilistic framework that models the probability of documents generating the given query.

•

LMD-LDA. An extension of the LMD retrieval model using Latent Dirichlet Allocation (LDA) which is recently proposed jian2016simple . This model considers latent topics in ranking the relevant documents

For the CL-SciSum data, we also compare against the top 5 best performing system. For brief description about these approaches refer to section 2.

Results.

The results on the TAC dataset are presented in Table 4. We observe that our proposed methods improve over all the baselines. Query Reformulation methods (NP and KW, respectively,) obtain character offset F1-scores of 23.8 and 24.1, which improve the best baseline by 7% and 8%. They also obtain higher Rouge scores. This shows that noun phrases and key words can capture informative concepts in the citation that help better retrieving the related reference context. Our models based on word embeddings are also outperforming the baselines in virtually all metrics. General domain embeddings trained on Wikipedia (WEwiki) and domain-specific embeddings trained on Genomics data (WEBio), achieve F1-scores of 23.2 and 25.5 with 4% and 14% improvement over the best baseline, respectively. Higher performance of the biomedical embeddings in comparison with general embeddings is expected because the words are captured in their correct context. An example is shown in Table 6, where the top similar words to the word “expression” are shown. The word “expression” in the biomedical context is defined as “the process by which genetic instructions are used to synthesize gene products”. As we can see, using general domain embeddings, we might fail to capture this notion. Incorporating domain knowledge in the model results in further improvement as shown in last two rows of Table 4. The model using retrofitting WEBio+Retrofit improves the best baseline by 18% while the interpolated model (WEBio+Domain) achieves the highest improvement by 21%. These results show the effectiveness of domain knowledge in the model.

Table 5 shows the results for the CL-SciSum dataset. The first 3 rows are baselines that also are reported in TAC evaluation; in addition to those baselines, we also consider top performing state-of-the-art systems of 2016 CL-SciSum (lines 4-8) as additional baselines to compare with. For the CL-SciSum participating systems, we report the official sentence based evaluation metrics; the Rouge scores and character based metrics were not reported in the official evaluation of the task. Some of our methods are specific to the biomedical domain such as WEBio; therefore, we do not evaluate those on the CL-SciSum dataset which is in a completely different domain.

As shown in Table 5, our methods outperform the state-of-the-art on this dataset as well. The embedding-based model with Wikipedia trained embeddings (WEwiki) achieves the best results with 13.9% F-1 score of sentence overlaps which is slightly higher than the F-1 score of 13.4 achieved by the best previous work (Tf-idf+stem in the Table) moraes2016university . Interestingly, we observe that retrofitting (WEwiki+Retrofit) does not improve over the standard embedding-based approach. This is likely due to the choice of the Wordnet lexicon for retrofitting. While Wordnet contains general domain terms, it does not necessarily capture relationships of words in the context of computational linguistics. In contrast to TAC where we had a domain specific lexicon suitable for the dataset, for the CL-SciSum data we did not find any lexicon capturing the term relationships in the computational linguistics domain. We believe that retrofitting with such lexicon, could result in further improvements. While query reformulation-based approaches improve over most of the baselines, their performance fall below the best baseline system. On the other hand, our supervised method also improves the best baseline, achieving the highest overall prevision (11.3%) and Rouge-2 (17.5%) and Rouge-3 scores (15.0%).131313We do not report results of supervised model on TAC dataset because the TAC data do not have separate train and test sets. It is encouraging that our embedding-based models (method names starting with “WE” in the Table 5), which are unsupervised models achieve the best results on this task and surpass the performance of the feature-rich supervised models. Table 7 shows the importance of each feature for our supervised method (explained in § 3.1.3). While the most important features are n-gram and character n-gram based tf-idf similarity, embedding based alignment and distance of average embeddings are also important in finding the correct context.

As evident from tables 4 and 5, the absolute system performances are not high, which further shows that this task is challenging. Since the TAC data are annotated by 4 people, we investigate the difficulty of this task for the human annotators. To do so, we calculate the agreement of the annotators with respect to the relevant context for the citations. Table 8 shows the number of citations grouped by the number of annotators that agree at least partially on the correct context. As illustrated, there are 68 citations out of 313 that all 4 annotators have partial agreement on the context span. This shows that the contextualization task is not trivial even for the human expert annotators.

Parameters

Our interpolated model of embeddings and domain knowledge (WEBio+Domain) has two main parameters $\gamma$ and $\lambda$ . Figure 4 shows the sensitivity of our model to different parameters. We observe that the best performance is achieved when $\gamma=0.8$ and $\lambda=0.5$ . Our models retrieve a ranked list of contexts for the citations; we choose a cut-off point for returning the final results. Figure 4(c) shows the effect of the cut-off point on one of our models.141414The cut-off point has similar effect on all the models. We observe that the optimal cut-off point for best sentence F1-score is 3.

4.3 Identifying discourse facets

Evaluation

The official metric for evaluation of discourse facet identification is the Precision, Recall and F1-scores of the discourse facets, conditioned on the correctness of the retrieved reference context jaidka2016overview . Therefore, we report the results for the CL-SciSum data based on this metric. For the TAC dataset, the official metric is the classification accuracy weighted by the annotator agreements.151515http://tac.nist.gov/2014/BiomedSumm/guidelines/ The accuracy for a system returned discourse facet is the number of annotators agreeing with that discourse facet divided by total number of annotators.

Results

Table 9 shows the results of our methods compared with the top performing official submitted runs to the CL-SciSum 2016. We do not report the results of low performing systems. The classification algorithm for identifying the discourse facets is the method described in Section 3.2 across all our methods. However, since only the correct retrieved contexts are rewarded, the performance of each model differs based on the accuracy of retrieving the correct contexts. We observe that most of our methods (except for the QR-NP) improve over all the baselines in terms of all metrics. We obtain substantial improvements especially in terms of precision. The best method for identifying the discourse facets is the supervised method (indicated with “supervised” in the Table) which obtains 36.1% F-1 score, improving the best baseline (“Jaccard Focused Method”) by 16%. Embedding methods also perform well by obtaining F-1 scores of 33.1% for the Wikipedia embeddings, and 34.8% for the retrofitted embeddings. These results further show the effectiveness of our contextualization methods along with the proposed classifier for identifying the facets.

We also demonstrate the intrinsic performance of our classifier for identifying the discourse facets in Table 10. As illustrated, the weighed average F1 performance over all discourse facets is 0.73. One challenge in identifying the discourse facets is the unbalanced dataset and the limited number of training examples for some specific facets. As also reflected in the table, we observe that for categories with smaller number of instances, the performance is generally lower. We therefore believe that having more training samples in the rare categories could further increase the performance.

Table 11 shows the results of facet identification in the TAC dataset as well as the effect of learning algorithms. Since for the TAC dataset there are 4 annotators, and the official metric is weighted accuracy scores, we also calculate the oracle score by always predicting what the majority of the annotators agree on. The oracle achieves 0.67 percent, suggesting that identifying discourse facets is not trivial for humans. We can see that the SVM classifier achieves the highest results with 81% relative accuracy to the oracle. For the CL-SciSum dataset, there is only one annotator per discourse facet and therefore, the weighted accuracy metrics translates to simple accuracy scores.

4.4 Summarization

We evaluate our summarization approach against the gold standard summaries written by human annotators. We set the summary length threshold to the average length of summary by words in each dataset (see Table 3). Table 12 shows the results for the summarization task. The first lines show the baselines which are existing summarization approaches including the SumBasic vanderwende2007beyond algorithm and the original citation-based summarization approach qazvinian2008scientific . The next four lines are the top state-of-the-art systems on the CL-SciSum dataset. For the CL-SciSum systems, the official reported results only included Rouge-2 and Rouge-SU4 scores. As illustrated in the table, virtually all our methods improve over the state-of-the-art, showing the effectiveness of our proposed summarization approach. Our best method (QR-NP-greedy) is based on the noun phrases query reformulation using the greedy strategy of sentence selection . It achieves Rouge-2 score of 30.2, which improves over the best baseline by 37.4%. In general, we can see that the greedy sentence selection strategy works better than the iterative approach. This is because the greedy strategy takes into account both the informativeness and the redundancy of the selected sentences.

Table 13 shows the results of summarization using on the TAC dataset. The reported approaches all use the greedy sentence selection strategy as it consistently outperforms the iterative approach. In general, while all our approaches outperform the baseline, query reformulation based approaches achieve the highest Rouge scores; query reformulation method using noun phrases (QR-NP) achieves 15.8 and 6.9 Rouge-2 and Rouge-3 scores, respectively which is the highest scores. The interpolated word embedding based model (WEBio+Domain) achieves the highest Rouge-su4 score (20.7). Comparing Tables 12 and 13 we notice that the scores for the TAC dataset are lower than that of CL-SciSum. This is due to the length of the generated summaries. As shown in Table 3, the average human summary length in the TAC data is almost 100 words more than the CL-SciSum summaries. An interesting observation in these two tables is regarding the relative poor performance of the citation-based summarization baseline (CLexRank) that only uses citation texts in comparison with our methods that also take advantage of the citation context and the discourse structure of the articles. This observation further confirms our initial hypothesis that relying only on the citation texts could result in summaries that do not accurately reflect the content of the original paper, and that adding citation contexts can help produce better summaries.

To better analyze the effect of identifying discourse facets on the overall quality of the summary, we compare the Rouge scores of the summary generated by our approach with and without this step. Table 14 shows the overall summarization results based on our QR-NP approach when we only use contextualized citations compared with when we use faceted contextualized citations. We observe that grouping citation contexts by their corresponding discourse facet has a positive effect on the quality of the summary on both datasets (17% and 55% improvements over TAC and CL-SciSum datasets in terms of Rouge-2, respectively). This is because identifying facets and grouping the contextualized citations by facets, results in a summary that captures the content from all sections of the paper. We observe similar trends for other variants of our approaches; for brevity we only show the results for QR-NP as an illustrative analysis on the effect of identifying discourse facets on the quality of the generated summary.

Finally, an example of the generated summaries by our system (QR-NP-greedy) that uses citation contexts and discourse facets is illustrated in Figure 5. We observe that compared with the human summary, the summary generated by our system can capture the significant points of the paper.

5 Discussion

Citations are a significant part of scientific papers and analysis of citation texts can provide valuable information for various scholary applications. Our work provides new approaches for contextualizing citations which is a sub-task for enriching citation texts and thus can benefit various bibliometric enhanced NLP applications such as information extraction, information retrieval, article recommendation, and article summarization. Our work provides a comprehensive new framework for summarizing scientific papers that helps generating better scientific summaries.

We note that our evaluation was based on the Rouge automatic summarization evaluation framework. Automatic evaluation metrics have their own limitations and cannot fully characterize the effectiveness of the systems. Manual or semi-manual evaluation of summarization (e.g. through Pyramid framework) are alternative evaluation approaches that can provide additional insights into the performance of the systems. Yet, due to expense and reproduction issues, most of the standard evaluation benchmarks including TAC and CL-SciSum have been evaluated through Rouge. As it is standard in the field and to be able to compare our results with the related work, we used the Rouge framework for evaluation. We also note that our focus has been on the content quality of the summaries and other criteria such as coherence and linguistic cohesion have not been the focus of our approach. Future work can investigate approaches for improving coherence and linguistic properties of the generated summaries.

6 Conclusions

We presented a unified framework for scientific summarization; our framework consists of three main parts: finding the context for the citations in the reference paper, identifying the discourse facet of each citation context, and generating the summary from the faceted citation contexts. We utilized query reformulation methods, word embeddings, and domain knowledge in our methods to capture the terminology variations between the citing and cited authors. We furthermore took advantage of the scientific discourse structure of the articles. We demonstrated the effectiveness of our approach on two scientific summarization benchmarks each in a different domain. We improved over the state-of-the-art by large margins in most of the tasks. While the results are encouraging, the absolute values of some metrics especially in the contextualization task suggest that this problem is worth further exploration. Contextualizing citations is a new task and not only it helps improving scientific summarization, but also it can benefit other bibliometric enhanced end-to-end applications such as keyword extraction, information retrieval, and article recommendation.

Bibliography80

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Abu-Jbara, A., Ezra, J., Radev, D.R.: Purpose and polarity of citation: Towards nlp-based bibliometrics. In: NAACL-HLT, pp. 596–606 (2013)
2(2) Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 500–509. Association for Computational Linguistics (2011)
3(3) Abu-Jbara, A., Radev, D.: Reference scope identification in citing sentences. In: NAACL-HLT, pp. 80–90. ACL (2012)
4(4) Atanassova, I., Bertin, M., Larivière, V., Bawden, D.: On the composition of scientific abstracts. Journal of Documentation 72 (4) (2016)
5(5) Bendersky, M., Croft, W.B.: Discovering key concepts in verbose queries. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 491–498. ACM (2008)
6(6) Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), 1798–1828 (2013)
7(7) Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. The Journal of Machine Learning Research 3 , 1137–1155 (2003)
8(8) Berg-Kirkpatrick, T., Gillick, D., Klein, D.: Jointly learning to extract and compress. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 481–490. Association for Computational Linguistics (2011)