QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining

Kyle R. Chickering; Bangzheng Li; Muhao Chen

arXiv:2505.23004·cs.LG·March 27, 2026

QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining

Kyle R. Chickering, Bangzheng Li, Muhao Chen

PDF

Open Access 1 Repo 3 Reviews

TL;DR

QLIP is a novel, content-aware quadtree-based visual encoder that seamlessly replaces CLIP in MLLMs, significantly enhancing visual understanding and question answering accuracy without retraining.

Contribution

Introduces QLIP, a drop-in, content-aware quadtree vision prior that overcomes CLIP limitations and improves MLLM performance without retraining.

Findings

01

Improves LLaVA v1.5 visual question answering accuracy

02

Enhances detailed understanding on V-star benchmark by up to 13.6%

03

Addresses CLIP limitations with minimal integration effort

Abstract

Multimodal Large Language Models (MLLMs) encode images into visual tokens, aligning visual and textual signals within a shared latent space to facilitate crossmodal representation learning. The CLIP model is a widely adopted foundational vision language model whose vision encoder has played a critical role in the development of MLLMs such as LLaVA. However, the CLIP vision encoder suffers from notable limitations including being constrained to only handling fixed input resolutions and a failure to produce separated embeddings for dissimilar images. Replacing the vision encoder of an existing model typically incurs substantial computational costs because such a change often necessitates retraining the entire model pipeline. In this work, we identify two factors which underlie the limitations of the CLIP vision encoder: mesoscopic bias and interpolation bias. To address these issues, we…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

Excellent Problem Diagnosis: The paper provides a clear and valuable conceptual framework by identifying and naming "mesoscopic bias" and "interpolation bias". This diagnosis of why CLIP fails at high resolutions is a useful contribution to the field.Strong Target-Task Performance: The +13.6% gain on the $V^{*}$ benchmark is extremely significant. This is a challenging benchmark designed to test the exact fine-grained failures of MLLMs, so this result strongly suggests the method is effective

Weaknesses

1. The paper's components are not fundamentally new. Quadtrees are a classic data structure, and their application to vision transformers and dynamic tokenization has been explored. Similarly, the token pruning/merging field is already well-established. The idea of replacing static positional embeddings with a dynamic, coordinate-based MLP is also a known technique (e.g., in NeRF). The novelty lies in the combination for this specific MLLM problem, but the components themselves are iterative. 2

Reviewer 02Rating 4Confidence 3

Strengths

Simple, training-free at inference time, and demonstrably effective.

Weaknesses

**Rapidly moving baseline.** The community is already shifting from CLIP to newer vision backbones (e.g. InternVL, SigLIP-2) that use RoPE or 2-D absolute + relative encoders and are pre-trained with native multi-resolution recipes. It is unclear whether QLIP retains any advantage when the underlying encoder itself is resolution-robust. A head-to-head comparison with such models is missing. **Limited to CLS-level bias.** QLIP is optimised to deliver a single, high-quality CLS token; it does

Reviewer 03Rating 6Confidence 4

Strengths

1. Practical & Cost-Effective Drop-in Solution: QLIP’s design as a "drop-in replacement" is highly impactful. By significantly enhancing the visual signal without necessitating the re-training or fine-tuning of the entire MLLM pipeline, it offers a practical, low-cost path to upgrading existing MLLMs. 2. Clear Theoretical Motivation: The paper clearly and quantitatively identifies two specific, fundamental biases in the CLIP vision encoder (mesoscopic and interpolation bias). The proposed solut

Weaknesses

1. The base model is out of data. LLaVA suffers limited performance. I suggest the authors to conduct evaluation on more SOTA models, such as InternVL-3.5 or QwenVL2.5 to truly demonstrate the effectiveness. 2. I wonder the performance on visual grounding benchmarks, such as refcoco, since it also requires fine-grained region information.

Code & Models

Repositories

kyrochi/qlip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis

MethodsContrastive Language-Image Pre-training · Attentive Walk-Aggregating Graph Neural Network