Advancing General Multimodal Capability of Vision-language Models with   Pyramid-descent Visual Position Encoding

Zhanpeng Chen; Mingxiao Li; Ziyang Chen; Nan Du; Xiaolong Li; Yuexian; Zou

arXiv:2501.10967·cs.CV·February 13, 2025

Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding

Zhanpeng Chen, Mingxiao Li, Ziyang Chen, Nan Du, Xiaolong Li, Yuexian, Zou

PDF

Open Access 1 Repo

TL;DR

This paper introduces PyPE, a novel visual position encoding method that enhances vision-language models' perception by addressing positional encoding limitations, leading to improved general capabilities across different model sizes.

Contribution

PyPE provides a new multi-granularity visual position encoding approach that improves perception in VLMs by mitigating traditional encoding issues and enhancing attention allocation.

Findings

01

PyPE improves VLM performance across various sizes.

02

Enhanced perception of visual tokens with PyPE.

03

Reduction of positional decay effects in encoding.

Abstract

Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence, yet the irrational encoding of visual positions persists in inhibiting the models' comprehensive perception performance across different levels of granularity. In this work, we propose Pyramid-descent Visual Position Encoding (PyPE), a novel approach designed to enhance the perception of visual tokens within VLMs. By assigning visual position indexes from the periphery to the center and expanding the central receptive field incrementally, PyPE addresses the limitations of traditional raster-scan methods and mitigates the long-term decay effects induced by Rotary Position Embedding (RoPE). Our method reduces the relative distance between interrelated visual elements and instruction tokens, promoting a more rational allocation of attention weights and allowing for a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sakuratroychen/pype
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need