On Efficient Language and Vision Assistants for Visually-Situated   Natural Language Understanding: What Matters in Reading and Reasoning

Geewook Kim; Minjoon Seo

arXiv:2406.11823·cs.CV·October 8, 2024

On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning

Geewook Kim, Minjoon Seo

PDF

Open Access 1 Repo 8 Models 1 Video

TL;DR

This paper investigates how to design efficient vision-language models for visually-situated language understanding, focusing on balancing model complexity, computational cost, and performance, and provides practical strategies for optimization.

Contribution

It identifies key components for efficient vision-language models and proposes methods to optimize them, achieving high performance with constrained inference costs.

Findings

01

Significant improvements in inference throughput without sacrificing accuracy

02

Insights into the impact of model size and vision modules on performance

03

Open-source release of models, code, and datasets for reproducibility

Abstract

Recent advancements in language and vision assistants have showcased impressive capabilities but suffer from a lack of transparency, limiting broader research and reproducibility. While open-source models handle general image tasks effectively, they face challenges with the high computational demands of complex visually-situated text understanding. Such tasks often require increased token inputs and large vision modules to harness high-resolution information. Striking a balance between model size and data importance remains an open question. This study aims to redefine the design of vision-language models by identifying key components and creating efficient models with constrained inference costs. By strategically formulating datasets, optimizing vision modules, and enhancing supervision techniques, we achieve significant improvements in inference throughput while maintaining high…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

naver-ai/elva
pytorchOfficial

Models

Videos

On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems · Speech and dialogue systems