FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with   Image Insertion

Jiacheng Ruan; Yebin Yang; Zehao Lin; Yuchen Feng; Feiyu Xiong; Zeyun; Tang; Zhiyu Li

arXiv:2410.12564·cs.CV·November 26, 2024

FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion

Jiacheng Ruan, Yebin Yang, Zehao Lin, Yuchen Feng, Feiyu Xiong, Zeyun, Tang, Zhiyu Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces FTII-Bench, a challenging multimodal benchmark for evaluating large vision-language models' ability to understand flow text and insert appropriate images, highlighting current models' limitations.

Contribution

The paper proposes a novel FTII task and benchmark, using real news articles to assess LVLMs' capabilities in complex image-text sequencing and comprehension.

Findings

01

Existing models struggle with the FTII task.

02

FTII-Bench includes 625 high-quality news articles across 10 domains.

03

Even advanced models like GPT-4o face significant challenges.

Abstract

Benefiting from the revolutionary advances in large language models (LLMs) and foundational vision models, large vision-language models (LVLMs) have also made significant progress. However, current benchmarks focus on tasks that evaluating only a single aspect of LVLM capabilities (e.g., recognition, detection, understanding). These tasks fail to fully demonstrate LVLMs' potential in complex application scenarios. To comprehensively assess the performance of existing LVLMs, we propose a more challenging task called the Flow Text with Image Insertion task (FTII). This task requires LVLMs to simultaneously possess outstanding abilities in image comprehension, instruction understanding, and long-text interpretation. Specifically, given several text paragraphs and a set of candidate images, as the text paragraphs accumulate, the LVLMs are required to select the most suitable image from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

IAAR-Shanghai/FTIIBench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Text Analysis Techniques · Natural Language Processing Techniques

MethodsFocus · Contrastive Language-Image Pre-training · Sparse Evolutionary Training