Inline Citation Classification using Peripheral Context and Time-evolving Augmentation
Priyanshi Gupta, Yash Kumar Atri, Apurva Nagvenkar, Sourish Dasgupta,, Tanmoy Chakraborty

TL;DR
This paper introduces a new dataset and a Transformer-based model for inline citation classification that leverages peripheral context and domain knowledge, achieving state-of-the-art results.
Contribution
The study presents 3Cext, a novel dataset with discourse information, and PeriCite, a deep neural network that fuses peripheral sentences and domain knowledge for improved citation classification.
Findings
PeriCite outperforms baselines with +0.09 F1 on 3Cext
The dataset includes discourse and domain context for citation analysis
Extensive ablations validate the effectiveness of context and knowledge fusion
Abstract
Citation plays a pivotal role in determining the associations among research articles. It portrays essential information in indicative, supportive, or contrastive studies. The task of inline citation classification aids in extrapolating these relationships; However, existing studies are still immature and demand further scrutiny. Current datasets and methods used for inline citation classification only use citation-marked sentences constraining the model to turn a blind eye to domain knowledge and neighboring contextual sentences. In this paper, we propose a new dataset, named 3Cext, which along with the cited sentences, provides discourse information using the vicinal sentences to analyze the contrasting and entailing relationships as well as domain information. We propose PeriCite, a Transformer-based deep neural network that fuses peripheral sentences and domain knowledge. Our model…
| First Sentence | Cited Sentence | Second Sentence |
|---|---|---|
| [CITE] describe a hybrid recommender system that exploits ontologies to increase the accuracy of the profiling process and hence the usefulness of the recommendations. | #AUTHOR_TAG use a different strategy by representing user profiles as bags-of-words and weighing each term according to the user interests derived from a domain ontology | Razmerita et al. [CITE] describe OntobUM, an ontology-based recommender that integrates three ontologies: i) the user ontology, which structures the characteristics of users and their relationships, ii) the domain ontology, which defines the domain |
| Content-based recommender systems [CITE] rely on pre-existing domain knowledge to suggest items more similar to the ones that the user seems to like. They usually generate user models that describe user interests according to features [CITE]. | This API supports a number of applications, including Smart Book Recommender, Smart Topic Miner [CITE], the Technology-Topic Framework #AUTHOR_TAG, a system that forecasts the propagation of technologies across research communities, and the Pragmatic Ontology Evolution Framework [CITE] | EOF |
| Dataset | Classes | |||||
|---|---|---|---|---|---|---|
| BACKGROUND | COM-CONT | EXTENSN | FUTR | MOTIVN | USES | |
| 3Cext | 1318 | 380 | 294 | 221 | 137 | 50 |
| Baseline System | Dataset | |||||
|---|---|---|---|---|---|---|
| 3C | 3Cext | |||||
| Precision | Recall | F1 Score | Precision | Recall | F1 Score | |
| Multi-layer Perceptron | 0.30 | 0.38 | 0.30 | 0.39 | 0.32 | 0.39 |
| LSTM-Attention-Scaffold | 0.27 | 0.48 | 0.46 | 0.30 | 0.55 | 0.39 |
| Transformer | 0.38 | 0.37 | 0.34 | 0.36 | 0.38 | 0.37 |
| DistilBERT | 0.43 | 0.48 | 0.37 | 0.46 | 0.55 | 0.40 |
| BART | 0.41 | 0.46 | 0.34 | 0.47 | 0.52 | 0.41 |
| T5 | 0.41 | 0.43 | 0.35 | 0.43 | 0.48 | 0.38 |
| SciBERT | 0.45 | 0.51 | 0.46 | 0.52 | 0.53 | 0.51 |
| PeriCite | ||||||
| PeriCite w/ SF | 0.38 | 0.37 | 0.34 | 0.36 | 0.38 | 0.37 |
| PeriCite w/ CTA | 0.34 | 0.36 | 0.32 | 0.38 | 0.41 | 0.37 |
| PeriCite w/ TEA | 0.36 | 0.43 | 0.41 | 0.38 | 0.42 | 0.39 |
| PeriCite w/ TEA, SA, CTA | 0.46 | 0.44 | 0.42 | 0.60 | 0.63 | 0.60 |
| Model | Class | Precision | Recall | F1 Score |
|---|---|---|---|---|
| PeriCite | BACKGROUND | 0.67 | 0.83 | 0.74 |
| COMPARE CONTRAST | 0.49 | 0.28 | 0.36 | |
| EXTENSION | 0.30 | 0.09 | 0.14 | |
| FUTURE | 0.33 | 0.17 | 0.22 | |
| MOTIVATION | 0.55 | 0.31 | 0.40 | |
| USES | 0.61 | 0.61 | 0.62 |
| Main Text | True | Pred |
|---|---|---|
| What has been termed episodic foresight (#AUTHOR_TAG, 2010), along with autobiographic memory and theory of mind, also makes up much of our mind wandering (Spreng and Grady, 2009), as we preview some future activity or consider possible future options in order to select appropriate action | 1 | 2 |
| We assess the similarity of two semantic vectors using the cosine similarity #AUTHOR_TAG, since this measure relies on the orientation but not the magnitude of the topic weights in the vector space, allowing us to compare editorial products associated with a different number of chapters | 0 | 5 |
| Following the process of reflection presented by #AUTHOR_TAG (1996), in the new version, the first word of 4 questions was added as a visual prompt | 0 | 5 |
| Some more recent models, though, have also included domain experts to define the learning content of the educational game (#AUTHOR_TAG et al., 2017) | 4 | 1 |
| Predicted | |||||||
|---|---|---|---|---|---|---|---|
| Background | Com-Cast | Extensions | Future | Motivation | Uses | ||
| Actual | Background | 280/320 | 13/0 | 5/0 | 1/0 | 9/0 | 27/0 |
| Com-Cast | 46/73 | 21/0 | 1/0 | 0/0 | 3/0 | 3/1 | |
| Extensions | 25/34 | 0/0 | 3/0 | 1/0 | 2/0 | 3/0 | |
| Future | 9/12 | 1/0 | 0/0 | 2/0 | 0/0 | 0/0 | |
| Motivation | 28/55 | 3/0 | 1/0 | 1/0 | 17/0 | 5/0 | |
| Uses | 30/79 | 5/0 | 0/0 | 1/0 | 0/0 | 59/16 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
11institutetext: 1IIIT-Delhi, India, 2Crimson AI, India, 3IIT Delhi, India
{priyanshig,yashk}@iiitd.ac.in, {apurva.nagvenkar,sourish.dasgupta}@crimsoni.ai,
Inline Citation Classification using Peripheral Context and Time-evolving Augmentation
Priyanshi Gupta1
Yash Kumar Atri1
Apurva Nagvenkar2
Sourish Dasgupta2
Tanmoy Chakraborty3
Abstract
Citation plays a pivotal role in determining the associations among research articles. It portrays essential information in indicative, supportive, or contrastive studies. The task of inline citation classification aids in extrapolating these relationships; However, existing studies are still immature and demand further scrutiny. Current datasets and methods used for inline citation classification only use citation-marked sentences constraining the model to turn a blind eye to domain knowledge and neighboring contextual sentences. In this paper, we propose a new dataset, named 3Cext, which along with the cited sentences, provides discourse information using the vicinal sentences to analyze the contrasting and entailing relationships as well as domain information. We propose PeriCite, a Transformer-based deep neural network that fuses peripheral sentences and domain knowledge. Our model achieves the state-of-the-art on the 3Cext dataset by F1 against the best baseline. We conduct extensive ablations to analyze the efficacy of the proposed dataset and model fusion methods.
Keywords:
citation classification bibliometrics transformer.
1 Introduction
For the past several decades, there has been an interest in citation analysis for research evaluation. Researchers have emphasized the necessity for new methodologies that take into account various components of citing sentences. A well-known qualitative technique for assessing the scientific influence is to analyze the sentence in which the research article is mentioned to ascertain the purpose behind the citation. The context of the citation, or the text in which the cited document is mentioned, has proven to be an effective indicator of the citation’s intent [25]. Measuring the scientific impact of research articles requires a fundamental understanding of citation intent. A great way to gauge the significance of a scientific publication is to determine why citations are made in one’s work and how significant they are.
Previous methods for citation context categorization used a range of annotation techniques with low-to-high granularity. Comparing the earlier systems is extremely difficult due to the absence of standardized methodologies and annotation schemes. The 3C shared task [12, 13] used a piece of the Academic Citation Typing (ACT) dataset to categorize the reference anchor into ‘function’ or ‘purpose’ by looking at the citing sentence or the text that contains the citation [19]. Only quantitative elements are considered in traditional citation analysis based solely on the citation count. One of the biggest obstacles to citation context analysis for citation identification is that there is no multidisciplinary dataset and that there isn’t any medium to fine-grained schemes that adequately represent the function and its influence [8]. To address this challenge, Kunnath et al. [12] provided a unified task, called the 3C Shared Task, to compare several citation classification approaches on the same dataset to address the shortcomings of citation context categorization. The main distinction in the second iteration of this task [13] was that the subtasks contained full-text datasets. However, even with the full text, the metadata associated with the citation sentence was not adequate to understand the reasoning for the citation.
To alleviate the above limitations, we propose a new dataset, named 3Cext, and a new model, named PeriCite that combines the advantages of augmentation and peripheral context. Experiments show that the cited sentences heavily rely on the peripheral context to strengthen an argument by contrasting or entailing information. Our main contributions are as follows
We extend the 3C dataset [13] – 3Cext, which, along with the cited sentence, adds more discourse information by providing contrasting and entailing information using the peripheral sentences. 2. 2.
We propose a novel model, PeriCite, which uses spatial fusion and cross-text attention to attend to contextual information for the peripheral sentences and time-evolving augmentation to counter class imbalance during the training time. 3. 3.
We also compare our proposed model against various baselines and show the efficacy of the module along with ablation studies and error analysis.
2 Related Work
Citations are important for persuasion since they provide a source of support for the assertions made by authors. Understanding whether the writers agree or disagree with the assertions made in the cited publication is crucial because not all citations are used with a similar purpose. In order to classify citations according to their context, a sizable corpus of research has previously examined the language used in scientific discourse. Several frameworks have been devised to categorize the intent of citations [16]. Many strategies were used in the early efforts for automatic citation intent categorization; they included rule-based systems [7, 18], machine learning techniques based on language patterns [9], and manually-constructed features from the citation context. Teufel et al. [25] introduced how to annotate citations in scholarly articles for 12 classes and used machine learning techniques to replicate annotation. These classes were split into four top-level groups, namely neural class, citations that expressly address weaknesses, citations that contrast or compare, and citations that concur with, use, or are compatible with the citing work. Abu-Jbara et al. [1] utilized a linear SVM and lexical, structural, and syntactic characteristics for categorization. Additionally, feature-based techniques [5, 4] for locating quoted spans in the mentioned publications have been studied. Improvements were shown by a joint prediction of cited spans and citation function using a CNN-based model [24].
Most of these initiatives offer far too fine-grained citation categories, some of which are infrequently used in articles. They, therefore, serve little purpose in automated analyses of scientific articles. Jurgens et al. [10] developed a six-category technique to incorporate earlier research and suggested a more precise categorization scheme expanding all previous feature-based work on citation intent classification. The authors also added six categories and samples from computational linguistics studies in addition to the three original features—pattern-based, topic-based, and prototype argument-based. They also used structural topology, lexical semantics, grammatical, field positions and values, and usage characteristics.
All of the methods listed above, classified data using hand-engineered features. Cohan et al. [3] proposed a neural multi-task learning technique for classifying citation intent using non-contextualized (GloVe [17]) and contextualized embeddings (ELMo [23], Bidirectional LSTM, and attention method). The authors used two auxiliary tasks to support the primary classification task in order to accomplish multi-task learning. Their recent research [3] included only three citation categories and instances from the Computer Science and Medical domains to make up their new dataset (SciCite). Beltagy et al. [2] released SciBERT, a model pre-trained over million papers from Semantic Scholar. To support the study in this area, a recent analysis by [11] evaluated 60 research articles on this topic, the difficulties the researchers had while conducting their work, and the knowledge gaps that still need to be addressed.
3 Methodology
In this section, we discuss our proposed model, PeriCite. It comprises two stacked Transformers, each with four blocks. It uses Cross-Text Attention to capture the discourse between the cited and peripheral sentences. PeriCite also houses Time-evolving Augmentation to synthetically generate data as per label loss and Spatial Fusion to fuse the final representations of stacked Transformers. Figure 1 shows the schematic diagram of PeriCite.
3.1 Cross-Text Attention
Attention formulation over a single input text may not provide adequate information to the model. However, when fused with the peripheral context attention, the model can learn important excerpts relative to the label. To fuse peripheral context attention to the main input, we propose Cross-Text Attention (CTA). It computes pairwise weights between the main text and a peripheral context.
Given the self-attention as
[TABLE]
initially, queries , keys , and values are generated for the main input text with and as their dimensions, respectively. Next, to compute the CTA score, pairwise weights between the main input and peripheral context are computed using
[TABLE]
[TABLE]
[TABLE]
Here denotes the contextual representation of the main input text, denotes the context of the peripheral text, and , , and with and represents queries, keys, and values based on the main and peripheral text, respectively. We then perform a linear projection of attention heads to capture language comprehension. Finally, the computed attention weights are passed through a feed-forward layer.
3.2 Spatial Fusion
Since the fusion module combines the generated attention vectors from both peripheral encoders, it is crucial to determine which vector is more significant, how it contributes to the main context and interacts with each other. Keeping this intuition in mind, we utilise a fusion strategy based on Spatial Fusion (SF). SF was first introduced by Li et al. [15] to fuse multiple images to decipher deep features. We extend it in our setting to fuse texts together to form deep contextual features.
In spatial fusion as suggested in [27], global features are modeled by a self-attention layer, which is obtained from a convolutional and a dimensionality reduction layer. The combination of these two captures the local as well as the global relations between the feature set. Further, we pass the obtained features to a range of convolutional blocks to enhance the contexts during fusion. Finally, an ordered weighting average map is created to merge features with the source text.
The multi-scale feature matrix is denoted by , where is the list of tensors . The fusion mechanism takes two inputs – one from the cited sentence as , and the other from the peripheral sentence as . The spatial attention [15] is created by the fusion and fed to the final decoding layer.
The Ordered Weighing maps are computed using the L1 normalization and softmax over and resulting in weight matrix and , respectively. The final Ordered Weighing maps are computed using Eq. 1 as follows,
[TABLE]
Here L1 norm is computed for both and with ranging in set. The position in the feature set and are indicated by () with a fixed vector size . The also outputs a vector of size V.
The feature vectors and over the weight matrix and are further enhanced by weighing them using using Eq. 2,
[TABLE]
Finally, the fused vector is computed by projecting it against the enhanced feature set using Eq. 3
[TABLE]
3.3 Time Evolving Augmentation
Data augmentation is useful to generate synthetic data and to balance a dataset. However, most of the data augmentation techniques generate synthetic data based on random transformations of the minor class. As shown in Table 2, our proposed dataset 3Cext also shows heavy class imbalance with the majority class showing three times the number of samples than the second major class. Two of the most popular ways to handle the class imbalance is either to make all classes equally representative in the training set or augment the minority class’s samples to match the majority class. However, for the task of inline citation classification, both of these methods lead to more degraded performance pertaining to structural complexity and information spread. To tackle these limitations, we propose Time-evolving Augmentation (TEA).
At every time step , TEA computes the label representation in each mini-batch as . For a loss computed at time step , the model computes the loss per label for a given mini batch and formulates a loss to label relationship as . Given the distribution , TEA synthetically generates training samples for the minority class having the highest loss in a given mini-batch. The data samples are augmented using the GPT2 language model [20]. The representation is independent of the global representation of a number of samples per class and only takes into account the representation of the given mini-batch. This method helps the model keep the loss in check for each label at every step. The loss of representation per label is a guiding factor for the TEA to evolve at every time step, helping to model to learn equal representation during the training phase.
4 Experiments
4.1 Dataset
In this section, we discuss our proposed 3Cext dataset in detail. Kunnath et al. [12] introduced the ACT dataset, with annotations for citations annotated by authors. The cited label was masked with “#AUTHOR TAG” denoting the position of the cited object. Additionally, the 3C dataset contained full text and the label denoting the class of a particle citation (c.f. Table 2).
In our work, we extend the 3C dataset to house more discourse information to explain better why a citation is present in a sentence. Our intuition is that the cited sentences mostly either entail or contrast the adjoining sentences. To capture the peripheral sentences, we extract the full-text files corresponding to the COREIDs (unique paper ID) in our dataset to follow through on this discovery. For a given document, we map the location of the main cited sentence and find the prefixed and the suffixed sentence. We use heuristic methods like regex, Levenshtein distance, and hard rules like full-stop identification and author name identification to mark three sentences. In our dataset, the first sentence indicates being the prefix sentence before the citation and second, the suffix sentence after the citation. The six categories of the classes are distributed in labels between 0 and 5, as 0 - BACKGROUND, 1 - COMPARES CONTRASTS, 2 - EXTENSION, 3 - FUTURE, 4 - MOTIVATION, 5 - USES (as suggested in [12]). Table 1 illustrates a sample instance from 3Cext, and Table 2 represents the number of instances per class.
4.2 Implementation Details
4.2.1 Dataset Setting:
On the dataset part, we first preprocess the data by removing stopwords, lowering cases, and removing all special characters. We clip the sentences at 256 token lengths for train and test instances. We use instances as a train set and as a test set.
5 Baselines
We discuss the baseline systems in detail. For the language model (LM)-based baselines, we fine-tune LM with the training samples of 3C and 3Cext till the convergence.
Multilayer Perceptron: We use three stacked dense layers [6] with softmax activation. We use Glove embeddings as input representation with cross-entropy as loss. 2. 2.
BiLSTM with Attention and Scaffolding [3]: This baseline uses BiLSTM with Attention with Glove as input embedding and Elmo as contextual representation. A 20-node MLP is used for the scaffolding task. We preserve all the original hyperparameters for the baseline. 3. 3.
Transformer: We use the vanilla Transformer [26] architecture to run as baseline. We use 4 stacked layers each in Encoder and Decoder with a max sequence length of 32, with softmax as the activation function and cross-entropy as loss. 4. 4.
DistilBERT [22]: It follows the same architecture of BERT but reduces the number of parameters by making use of knowledge distillation during pretraining. We use the huggingface ported model for the baseline. 5. 5.
SciBERT: SciBERT [2] uses the standard BERT architecture with pretraining performed on the scientific documents. The hyperparameters are similar to the Transformer baseline. 6. 6.
BART: Similar to SciBERT, BART [14] uses a bidirectional encoder along with an autoregressive decoder. It is pre-trained over the Books and Wikipedia data. We use similar hyperparameters to the SciBERT baseline. 7. 7.
T5: It is a pretrained Transformer based encoder-decoder language model [21]. It is pretrained as a text-to-text Transformer over various supervised and unsupervised downstream tasks.
6 Analysis
We perform ablation studies on our proposed PeriCite model to showcase the efficacy of each module. We show that the peripheral context alone can significantly improve the model’s performance by providing contextual information. The addition of TEA pushes the performance for each class, concluding that controlled synthetic generation of training data improves the system’s overall performance. Table 3 shows that our model attains an improvement of +0.09 F1 points with CTA, SA, and TEA against the best baseline. The improvements are seen in every module. With the introduction of CTA, our model attains an improvement of +0.02 Recall against the base Transformer network. TAdding TEA shows an improvement of +0.02 F1 and +0.04 Recall against the Transformer.
Table 4 shows the class-wise performance of PeriCite. It shows that our model is able to capture contextual information for all classes. When compared to the best baseline’s (SciBERT) confusion matrix in Table 6, we see that the baseline leans heavily towards the majority class and predicts [math] for almost all other classes. However, our model was able to predict all classes uniformly. We also analyze the model’s prediction errors in Table 5. For the sentence in the second row, the model might be distracted by the phrase “we assess the similarity” giving it the impression that it is a “use” category. The third row is also likely to be distracted by the phrase “Following the process of reflection”. The mislabelling is probably due to very low number of training samples for these classes. Providing more contextual information and large number of qualitative training samples can help improve the performance of the model.
7 Conclusion
In this paper, we proposed a new dataset 3Cext, where we extended the 3C dataset by introducing peripheral contextual information to analyze the contrasting and entailing information. We also introduced a novel model, PeriCite that uses cross-text attention to attend to the contextual information present in citation input and the peripheral sentences. We also introduced time-evolving augmentation to generate synthetic data for the minority classes during each time step and spatial fusion to attend to the critical information in the input space. Our proposed model achieves a new state-of-the-art on 3Cext by +0.09 F1 score against the best baseline. We also conducted extensive ablations to analyze the efficacy of the proposed dataset and model fusion methods. For future works, an exciting line of work can be to utilize the discourse information of the sections to provide more context to the inline citations. The contrasting or entailing information in the neighbouring sentence is crucial in understanding the reasoning’s of citation intent. Additional tasks like baseline recommendations, scientific paper recommendations, etc., can be greatly improved with the performance improvement over this task.
8 Acknowledgment
The research reported in this paper is funded by Crimson AI Pvt. Ltd.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Abu-Jbara, A., Ezra, J., Radev, D.: Purpose and polarity of citation: Towards nlp-based bibliometrics. In: NAACL. pp. 596–606 (2013)
- 2[2] Beltagy, I., Lo, K., Cohan, A.: Scibert: A pretrained language model for scientific text. ar Xiv preprint ar Xiv:1903.10676 (2019)
- 3[3] Cohan, A., Ammar, W., van Zuylen, M., Cady, F.: Structural scaffolds for citation intent classification in scientific publications. In: NAACL. pp. 3586–3596. ACL, Minneapolis, Minnesota (Jun 2019)
- 4[4] Cohan, A., Goharian, N.: Contextualizing citations for scientific summarization using word embeddings and domain knowledge. In: ACM SIGIR. pp. 1133–1136 (2017)
- 5[5] Cohan, A., Soldaini, L., Goharian, N.: Matching citation text and cited spans in biomedical literature: a search-oriented approach. In: NAACL. pp. 1042–1048 (2015)
- 6[6] Gardner, M.W., Dorling, S.: Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric environment 32 (14-15), 2627–2636 (1998)
- 7[7] Garzone, M., Mercer, R.E.: Towards an automated citation classifier. In: Conference of the canadian society for computational studies of intelligence. pp. 337–346. Springer (2000)
- 8[8] Hernández-Alvarez, M., Gómez, J.M.: Citation impact categorization: for scientific literature. In: 2015 IEEE ICCSE 18th International Conference on Computational Science and Engineering. pp. 307–313. IEEE (2015)
