TL;DR
LLaVA-SP introduces a simple yet effective method of adding spatial visual tokens to enhance local visual relationships in multimodal large language models, significantly improving performance on diverse visual understanding tasks.
Contribution
The paper proposes a novel approach to incorporate visual spatial tokens into MLLMs, improving local feature modeling and overall visual representation quality.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Outperforms LLaVA-1.5 with similar inference latency.
Enables diverse visual understanding through two model variants.
Abstract
The architecture of multimodal large language models (MLLMs) commonly connects a vision encoder, often based on CLIP-ViT, to a large language model. While CLIP-ViT works well for capturing global image features, it struggles to model local relationships between adjacent patches, leading to weaker visual representation, which in turn affects the detailed understanding ability of MLLMs. To solve this, we propose LLaVA-SP, which only adds six spatial visual tokens to the original visual tokens to enhance the visual representation. Our approach offers three key advantages: 1) We propose a novel Projector, which uses convolutional kernels to derive visual spatial tokens from ViT patch features, simulating two visual spatial ordering approaches: "from central region to global" and "from abstract to specific". Then, a cross-attention mechanism is applied to fuse fine-grained visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
