Talk2Ref: A Dataset for Reference Prediction from Scientific Talks
Frederik Broy, Maike Z\"ufle, Jan Niehues

TL;DR
This paper introduces Talk2Ref, a large-scale dataset for reference prediction from scientific talks, and demonstrates that fine-tuning models on this data improves citation prediction accuracy in spoken scientific content.
Contribution
The paper presents the first large-scale dataset for reference prediction from scientific talks and establishes baseline models, advancing research in linking spoken scientific presentations to relevant literature.
Findings
Fine-tuning on Talk2Ref improves citation prediction performance.
State-of-the-art models face challenges with long transcripts.
The dataset enables better semantic understanding of spoken scientific content.
Abstract
Scientific talks are a growing medium for disseminating research, and automatically identifying relevant literature that grounds or enriches a talk would be highly valuable for researchers and students alike. We introduce Reference Prediction from Talks (RPT), a new task that maps long, and unstructured scientific presentations to relevant papers. To support research on RPT, we present Talk2Ref, the first large-scale dataset of its kind, containing 6,279 talks and 43,429 cited papers (26 per talk on average), where relevance is approximated by the papers cited in the talk's corresponding source publication. We establish strong baselines by evaluating state-of-the-art text embedding models in zero-shot retrieval scenarios, and propose a dual-encoder architecture trained on Talk2Ref. We further explore strategies for handling long transcripts, as well as training for domain adaptation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
