QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining
Kyle R. Chickering, Bangzheng Li, Muhao Chen

TL;DR
QLIP is a novel, content-aware quadtree-based visual encoder that seamlessly replaces CLIP in MLLMs, significantly enhancing visual understanding and question answering accuracy without retraining.
Contribution
Introduces QLIP, a drop-in, content-aware quadtree vision prior that overcomes CLIP limitations and improves MLLM performance without retraining.
Findings
Improves LLaVA v1.5 visual question answering accuracy
Enhances detailed understanding on V-star benchmark by up to 13.6%
Addresses CLIP limitations with minimal integration effort
Abstract
Multimodal Large Language Models (MLLMs) encode images into visual tokens, aligning visual and textual signals within a shared latent space to facilitate crossmodal representation learning. The CLIP model is a widely adopted foundational vision language model whose vision encoder has played a critical role in the development of MLLMs such as LLaVA. However, the CLIP vision encoder suffers from notable limitations including being constrained to only handling fixed input resolutions and a failure to produce separated embeddings for dissimilar images. Replacing the vision encoder of an existing model typically incurs substantial computational costs because such a change often necessitates retraining the entire model pipeline. In this work, we identify two factors which underlie the limitations of the CLIP vision encoder: mesoscopic bias and interpolation bias. To address these issues, we…
Peer Reviews
Decision·ICLR 2026 Poster
Excellent Problem Diagnosis: The paper provides a clear and valuable conceptual framework by identifying and naming "mesoscopic bias" and "interpolation bias". This diagnosis of why CLIP fails at high resolutions is a useful contribution to the field.Strong Target-Task Performance: The +13.6% gain on the $V^{*}$ benchmark is extremely significant. This is a challenging benchmark designed to test the exact fine-grained failures of MLLMs, so this result strongly suggests the method is effective
1. The paper's components are not fundamentally new. Quadtrees are a classic data structure, and their application to vision transformers and dynamic tokenization has been explored. Similarly, the token pruning/merging field is already well-established. The idea of replacing static positional embeddings with a dynamic, coordinate-based MLP is also a known technique (e.g., in NeRF). The novelty lies in the combination for this specific MLLM problem, but the components themselves are iterative. 2
Simple, training-free at inference time, and demonstrably effective.
**Rapidly moving baseline.** The community is already shifting from CLIP to newer vision backbones (e.g. InternVL, SigLIP-2) that use RoPE or 2-D absolute + relative encoders and are pre-trained with native multi-resolution recipes. It is unclear whether QLIP retains any advantage when the underlying encoder itself is resolution-robust. A head-to-head comparison with such models is missing. **Limited to CLS-level bias.** QLIP is optimised to deliver a single, high-quality CLS token; it does
1. Practical & Cost-Effective Drop-in Solution: QLIP’s design as a "drop-in replacement" is highly impactful. By significantly enhancing the visual signal without necessitating the re-training or fine-tuning of the entire MLLM pipeline, it offers a practical, low-cost path to upgrading existing MLLMs. 2. Clear Theoretical Motivation: The paper clearly and quantitatively identifies two specific, fundamental biases in the CLIP vision encoder (mesoscopic and interpolation bias). The proposed solut
1. The base model is out of data. LLaVA suffers limited performance. I suggest the authors to conduct evaluation on more SOTA models, such as InternVL-3.5 or QwenVL2.5 to truly demonstrate the effectiveness. 2. I wonder the performance on visual grounding benchmarks, such as refcoco, since it also requires fine-grained region information.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
MethodsContrastive Language-Image Pre-training · Attentive Walk-Aggregating Graph Neural Network
