CS-PaperSum: A Large-Scale Dataset of AI-Generated Summaries for   Scientific Papers

Javin Liu; Aryan Vats; Zihao He

arXiv:2502.20582·cs.IR·March 3, 2025

CS-PaperSum: A Large-Scale Dataset of AI-Generated Summaries for Scientific Papers

Javin Liu, Aryan Vats, Zihao He

PDF

Open Access

TL;DR

CS-PaperSum is a large dataset of over 91,000 computer science papers with AI-generated structured summaries, facilitating automated literature analysis and research trend identification.

Contribution

We created a large-scale dataset with AI-generated summaries for scientific papers, enabling advanced analysis of research trends and scientific discovery.

Findings

01

Strong preservation of key concepts in summaries

02

Identification of emerging research methodologies

03

Insights into interdisciplinary research trends

Abstract

The rapid expansion of scientific literature in computer science presents challenges in tracking research trends and extracting key insights. Existing datasets provide metadata but lack structured summaries that capture core contributions and methodologies. We introduce CS-PaperSum, a large-scale dataset of 91,919 papers from 31 top-tier computer science conferences, enriched with AI-generated structured summaries using ChatGPT. To assess summary quality, we conduct embedding alignment analysis and keyword overlap analysis, demonstrating strong preservation of key concepts. We further present a case study on AI research trends, highlighting shifts in methodologies and interdisciplinary crossovers, including the rise of self-supervised learning, retrieval-augmented generation, and multimodal AI. Our dataset enables automated literature analysis, research trend forecasting, and AI-driven…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Biomedical Text Mining and Ontologies · Topic Modeling