PosterSum: A Multimodal Benchmark for Scientific Poster Summarization
Rohit Saxena, Pasquale Minervini, Frank Keller

TL;DR
PosterSum introduces a new multimodal benchmark dataset of scientific posters with abstracts, highlighting challenges for vision-language models and proposing a hierarchical method that improves summarization performance.
Contribution
The paper presents PosterSum, a large multimodal dataset for scientific poster summarization, and proposes a hierarchical approach that outperforms existing models.
Findings
State-of-the-art MLLMs struggle with poster understanding
Hierarchical method improves ROUGE-L by 3.14%
PosterSum serves as a new benchmark for future research
Abstract
Generating accurate and concise textual summaries from multimodal documents is challenging, especially when dealing with visually complex content like scientific posters. We introduce PosterSum, a novel benchmark to advance the development of vision-language models that can understand and summarize scientific posters into research paper abstracts. Our dataset contains 16,305 conference posters paired with their corresponding abstracts as summaries. Each poster is provided in image format and presents diverse visual understanding challenges, such as complex layouts, dense text regions, tables, and figures. We benchmark state-of-the-art Multimodal Large Language Models (MLLMs) on PosterSum and demonstrate that they struggle to accurately interpret and summarize scientific posters. We propose Segment & Summarize, a hierarchical method that outperforms current MLLMs on automated metrics,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Advanced Text Analysis Techniques
