On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning
Geewook Kim, Minjoon Seo

TL;DR
This paper investigates how to design efficient vision-language models for visually-situated language understanding, focusing on balancing model complexity, computational cost, and performance, and provides practical strategies for optimization.
Contribution
It identifies key components for efficient vision-language models and proposes methods to optimize them, achieving high performance with constrained inference costs.
Findings
Significant improvements in inference throughput without sacrificing accuracy
Insights into the impact of model size and vision modules on performance
Open-source release of models, code, and datasets for reproducibility
Abstract
Recent advancements in language and vision assistants have showcased impressive capabilities but suffer from a lack of transparency, limiting broader research and reproducibility. While open-source models handle general image tasks effectively, they face challenges with the high computational demands of complex visually-situated text understanding. Such tasks often require increased token inputs and large vision modules to harness high-resolution information. Striking a balance between model size and data importance remains an open question. This study aims to redefine the design of vision-language models by identifying key components and creating efficient models with constrained inference costs. By strategically formulating datasets, optimizing vision modules, and enhancing supervision techniques, we achieve significant improvements in inference throughput while maintaining high…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗gwkrsrch/Elva-Llama-160Mmodel
- 🤗gwkrsrch/Elva-Tiny-Vicuna-1.1Bmodel
- 🤗gwkrsrch/Elva-Phi3-3.8Bmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗gwkrsrch/Elva-Vicuna-7Bmodel· 3 dl3 dl
- 🤗gwkrsrch/Elva-Vicuna-13Bmodel
- 🤗gwkrsrch/Elva-OpenELM-1.1Bmodel· 1 dl1 dl
- 🤗gwkrsrch/Elva-OpenELM-270Mmodel· 1 dl1 dl
- 🤗gwkrsrch/Elva-OpenELM-450Mmodel· 1 dl1 dl
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems · Speech and dialogue systems
