TL;DR
This paper introduces the Multimodal Conference Dataset (MCD), a benchmark for aligning and understanding scientific content across text, visuals, and speech, evaluating current models' capabilities and limitations.
Contribution
The paper presents the first benchmark integrating multiple scientific media and systematically evaluates models' ability to discover cross-format correspondences.
Findings
Vision-language models are robust but struggle with fine-grained alignment.
Embedding-based models capture text-visual correspondences well.
Equations and symbolic content form distinct clusters in embeddings.
Abstract
The communication of scientific knowledge has become increasingly multimodal, spanning text, visuals, and speech through materials such as research papers, slides, and recorded presentations. These different representations collectively convey a study's reasoning, results, and insights, offering complementary perspectives that enrich understanding. However, despite their shared purpose, such materials are rarely connected in a structured way. The absence of explicit links across formats makes it difficult to trace how concepts, visuals, and explanations correspond, limiting unified exploration and analysis of research content. To address this gap, we introduce the Multimodal Conference Dataset (MCD), the first benchmark that integrates research papers, presentation videos, explanatory videos, and slides from the same works. We evaluate a range of embedding-based and vision-language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
