What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

Dongqi Liu; Chenxi Whitehouse; Xi Yu; Louis Mahon; Rohit Saxena; Zheng Zhao; Yifu Qiu; Mirella Lapata; Vera Demberg

arXiv:2502.08279·cs.CL·May 27, 2025

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

Dongqi Liu, Chenxi Whitehouse, Xi Yu, Louis Mahon, Rohit Saxena, Zheng Zhao, Yifu Qiu, Mirella Lapata, Vera Demberg

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces VISTA, a new dataset for scientific video-to-text summarization, and demonstrates that explicit planning improves summary quality, though a significant gap with human performance remains.

Contribution

The paper presents VISTA, a large dataset for scientific video summarization, and evaluates the effectiveness of plan-based models in improving summary quality.

Findings

01

Explicit planning enhances summary quality and factual consistency.

02

State-of-the-art models still lag behind human performance.

03

VISTA dataset highlights challenges in scientific video summarization.

Abstract

Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of our dataset. This study aims to pave the way for future research on scientific video-to-text summarization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dongqi-me/vista
pytorchOfficial

Videos

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations· underline

Taxonomy

TopicsAdvanced Text Analysis Techniques · Topic Modeling · Biomedical Text Mining and Ontologies