Exploring the Capabilities of Large Multimodal Models on Dense Text
Shuo Zhang, Biao Yang, Zhang Li, Zhiyin Ma, Yuliang Liu, Xiang Bai

TL;DR
This paper evaluates large multimodal models on dense textual tasks using a new dataset, revealing their strengths and weaknesses, and demonstrates that prompt engineering and fine-tuning significantly improve performance.
Contribution
The paper introduces the DT-VQA dataset for dense text tasks and provides a comprehensive evaluation of LMMs, highlighting strategies to enhance their capabilities.
Findings
Significant performance improvements with prompt engineering and fine-tuning.
GPT4V and Gemini outperform open-source models on dense text tasks.
Automatically labeled datasets can effectively boost model performance.
Abstract
While large multi-modal models (LMM) have shown notable progress in multi-modal tasks, their capabilities in tasks involving dense textual content remains to be fully explored. Dense text, which carries important information, is often found in documents, tables, and product descriptions. Understanding dense text enables us to obtain more accurate information, assisting in making better decisions. To further explore the capabilities of LMM in complex text tasks, we propose the DT-VQA dataset, with 170k question-answer pairs. In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs on our dataset, revealing their strengths and weaknesses. Furthermore, we evaluate the effectiveness of two strategies for LMM: prompt engineering and downstream fine-tuning. We find that even with automatically labeled training datasets, significant improvements in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Topic Modeling · Sentiment Analysis and Opinion Mining
