FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion
Jiacheng Ruan, Yebin Yang, Zehao Lin, Yuchen Feng, Feiyu Xiong, Zeyun, Tang, Zhiyu Li

TL;DR
This paper introduces FTII-Bench, a challenging multimodal benchmark for evaluating large vision-language models' ability to understand flow text and insert appropriate images, highlighting current models' limitations.
Contribution
The paper proposes a novel FTII task and benchmark, using real news articles to assess LVLMs' capabilities in complex image-text sequencing and comprehension.
Findings
Existing models struggle with the FTII task.
FTII-Bench includes 625 high-quality news articles across 10 domains.
Even advanced models like GPT-4o face significant challenges.
Abstract
Benefiting from the revolutionary advances in large language models (LLMs) and foundational vision models, large vision-language models (LVLMs) have also made significant progress. However, current benchmarks focus on tasks that evaluating only a single aspect of LVLM capabilities (e.g., recognition, detection, understanding). These tasks fail to fully demonstrate LVLMs' potential in complex application scenarios. To comprehensively assess the performance of existing LVLMs, we propose a more challenging task called the Flow Text with Image Insertion task (FTII). This task requires LVLMs to simultaneously possess outstanding abilities in image comprehension, instruction understanding, and long-text interpretation. Specifically, given several text paragraphs and a set of candidate images, as the text paragraphs accumulate, the LVLMs are required to select the most suitable image from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Natural Language Processing Techniques
MethodsFocus · Contrastive Language-Image Pre-training · Sparse Evolutionary Training
