FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression
Bo Tong, Bokai Lai, Yiyi Zhou, Gen Luo, Yunhang Shen and, Ke Li, Xiaoshuai Sun, Rongrong Ji

TL;DR
FlashSloth is a fast, efficient multimodal large language model that uses embedded visual compression to reduce visual tokens and computational load while maintaining high performance across visual-language tasks.
Contribution
The paper introduces FlashSloth, a novel tiny MLLM that enhances visual token compression to improve speed and efficiency without sacrificing performance.
Findings
Significantly reduces visual tokens, training memory, and computation complexity.
Maintains high performance on various visual-language tasks.
Outperforms existing tiny MLLMs like InternVL2, MiniCPM-V2, and Qwen2-VL.
Abstract
Despite a big leap forward in capability, multimodal large language models (MLLMs) tend to behave like a sloth in practical use, i.e., slow response and large latency. Recent efforts are devoted to building tiny MLLMs for better efficiency, but the plethora of visual tokens still used limit their actual speedup. In this paper, we propose a powerful and fast tiny MLLM called FlashSloth. Different from previous efforts, FlashSloth focuses on improving the descriptive power of visual tokens in the process of compressing their redundant semantics. In particular, FlashSloth introduces embedded visual compression designs to capture both visually salient and instruction-related image information, so as to achieving superior multimodal performance with fewer visual tokens. Extensive experiments are conducted to validate the proposed FlashSloth, and a bunch of tiny but strong MLLMs are also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Computational Physics and Python Applications · Topic Modeling
