TL;DR
This paper introduces CiteWorth, a large, high-quality dataset for cite-worthiness detection in scientific texts, and demonstrates that context-aware models significantly outperform sentence-only models, improving scientific document understanding.
Contribution
The paper presents CiteWorth, a novel large dataset for cite-worthiness detection, and develops a context-aware model that outperforms previous sentence-based approaches.
Findings
CiteWorth dataset is high-quality and suitable for domain adaptation.
Contextualized models outperform sentence-only models by 5 F1 points.
Fine-tuning language models on cite-worthiness improves downstream tasks.
Abstract
Scientific document understanding is challenging as the data is highly domain specific and diverse. However, datasets for tasks with scientific text require expensive manual annotation and tend to be small and limited to only one or a few fields. At the same time, scientific documents contain many potential training signals, such as citations, which can be used to build large labelled datasets. Given this, we present an in-depth study of cite-worthiness detection in English, where a sentence is labelled for whether or not it cites an external source. To accomplish this, we introduce CiteWorth, a large, contextualized, rigorously cleaned labelled dataset for cite-worthiness detection built from a massive corpus of extracted plain-text scientific documents. We show that CiteWorth is high-quality, challenging, and suitable for studying problems such as domain adaptation. Our best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsHow do I get a human at Expedia immediately? (2025-2026) · Multi-Head Attention · Linear Layer · AdamW · WordPiece · How do I complain to Expedia?*ComplainByAgent · Layer Normalization · Attention Dropout · Softmax · Refunds@Expedia|||How do I get a full refund from Expedia?
