Talk2Ref: A Dataset for Reference Prediction from Scientific Talks

Frederik Broy; Maike Z\"ufle; Jan Niehues

arXiv:2510.24478·cs.CL·October 29, 2025

Talk2Ref: A Dataset for Reference Prediction from Scientific Talks

Frederik Broy, Maike Z\"ufle, Jan Niehues

PDF

2 Models 1 Datasets

TL;DR

This paper introduces Talk2Ref, a large-scale dataset for reference prediction from scientific talks, and demonstrates that fine-tuning models on this data improves citation prediction accuracy in spoken scientific content.

Contribution

The paper presents the first large-scale dataset for reference prediction from scientific talks and establishes baseline models, advancing research in linking spoken scientific presentations to relevant literature.

Findings

01

Fine-tuning on Talk2Ref improves citation prediction performance.

02

State-of-the-art models face challenges with long transcripts.

03

The dataset enables better semantic understanding of spoken scientific content.

Abstract

Scientific talks are a growing medium for disseminating research, and automatically identifying relevant literature that grounds or enriches a talk would be highly valuable for researchers and students alike. We introduce Reference Prediction from Talks (RPT), a new task that maps long, and unstructured scientific presentations to relevant papers. To support research on RPT, we present Talk2Ref, the first large-scale dataset of its kind, containing 6,279 talks and 43,429 cited papers (26 per talk on average), where relevance is approximated by the papers cited in the talk's corresponding source publication. We establish strong baselines by evaluating state-of-the-art text embedding models in zero-shot retrieval scenarios, and propose a dual-encoder architecture trained on Talk2Ref. We further explore strategies for handling long transcripts, as well as training for domain adaptation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

s8frbroy/talk2ref
dataset· 16 dl
16 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.