Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding
Zhanpeng Chen, Mingxiao Li, Ziyang Chen, Nan Du, Xiaolong Li, Yuexian, Zou

TL;DR
This paper introduces PyPE, a novel visual position encoding method that enhances vision-language models' perception by addressing positional encoding limitations, leading to improved general capabilities across different model sizes.
Contribution
PyPE provides a new multi-granularity visual position encoding approach that improves perception in VLMs by mitigating traditional encoding issues and enhancing attention allocation.
Findings
PyPE improves VLM performance across various sizes.
Enhanced perception of visual tokens with PyPE.
Reduction of positional decay effects in encoding.
Abstract
Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence, yet the irrational encoding of visual positions persists in inhibiting the models' comprehensive perception performance across different levels of granularity. In this work, we propose Pyramid-descent Visual Position Encoding (PyPE), a novel approach designed to enhance the perception of visual tokens within VLMs. By assigning visual position indexes from the periphery to the center and expanding the central receptive field incrementally, PyPE addresses the limitations of traditional raster-scan methods and mitigates the long-term decay effects induced by Rotary Position Embedding (RoPE). Our method reduces the relative distance between interrelated visual elements and instruction tokens, promoting a more rational allocation of attention weights and allowing for a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems
MethodsSoftmax · Attention Is All You Need
