SciXGen: A Scientific Paper Dataset for Context-Aware Text Generation
Hong Chen, Hiroya Takamura, Hideki Nakayama

TL;DR
This paper introduces SciXGen, a large-scale dataset for context-aware scientific text generation, and benchmarks its effectiveness in generating descriptions and paragraphs using state-of-the-art models.
Contribution
It presents a novel dataset, SciXGen, with over 200,000 annotated scientific papers for advancing context-aware text generation in science.
Findings
SciXGen improves scientific text generation quality.
State-of-the-art models perform well on the dataset.
The dataset is publicly available for research.
Abstract
Generating texts in scientific papers requires not only capturing the content contained within the given input but also frequently acquiring the external information called \textit{context}. We push forward the scientific text generation by proposing a new task, namely \textbf{context-aware text generation} in the scientific domain, aiming at exploiting the contributions of context in generated texts. To this end, we present a novel challenging large-scale \textbf{Sci}entific Paper Dataset for Conte\textbf{X}t-Aware Text \textbf{Gen}eration (SciXGen), consisting of well-annotated 205,304 papers with full references to widely-used objects (e.g., tables, figures, algorithms) in a paper. We comprehensively benchmark, using state-of-the-arts, the efficacy of our newly constructed SciXGen dataset in generating description and paragraph. Our dataset and benchmarks will be made publicly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
