LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs

Haoran Lou; Chunxiao Fan; Ziyan Liu; Yuexin Wu; Xinliang Wang

arXiv:2507.00505·cs.CV·July 8, 2025

LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs

Haoran Lou, Chunxiao Fan, Ziyan Liu, Yuexin Wu, Xinliang Wang

PDF

1 Repo 1 Models

TL;DR

LLaVA-SP introduces a simple yet effective method of adding spatial visual tokens to enhance local visual relationships in multimodal large language models, significantly improving performance on diverse visual understanding tasks.

Contribution

The paper proposes a novel approach to incorporate visual spatial tokens into MLLMs, improving local feature modeling and overall visual representation quality.

Findings

01

Achieves state-of-the-art performance on multiple benchmarks.

02

Outperforms LLaVA-1.5 with similar inference latency.

03

Enables diverse visual understanding through two model variants.

Abstract

The architecture of multimodal large language models (MLLMs) commonly connects a vision encoder, often based on CLIP-ViT, to a large language model. While CLIP-ViT works well for capturing global image features, it struggles to model local relationships between adjacent patches, leading to weaker visual representation, which in turn affects the detailed understanding ability of MLLMs. To solve this, we propose LLaVA-SP, which only adds six spatial visual tokens to the original visual tokens to enhance the visual representation. Our approach offers three key advantages: 1) We propose a novel Projector, which uses convolutional kernels to derive visual spatial tokens from ViT patch features, simulating two visual spatial ordering approaches: "from central region to global" and "from abstract to specific". Then, a cross-attention mechanism is applied to fuse fine-grained visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cnfaker/llava-sp
pytorchOfficial

Models

🤗
Levideus/llava-sp-cropping-lora
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.