CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

Donghee Lee; Rui Cai; Zhe Zhao

arXiv:2601.13622·cs.CV·March 30, 2026

CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

Donghee Lee, Rui Cai, Zhe Zhao

PDF

TL;DR

CARPE is a lightweight framework that improves large vision-language models by adaptively balancing visual and textual representations, enhancing performance on image classification and vision-language tasks.

Contribution

It introduces a novel ensemble-based method that dynamically prioritizes image features within LVLMs, addressing the weakness in vision-centric capabilities caused by language alignment.

Findings

01

CARPE improves image classification accuracy.

02

It enhances performance on diverse vision-language benchmarks.

03

Modality balancing is crucial for multimodal generalization.

Abstract

Large vision-language models (LVLMs) are typically trained using autoregressive language modeling objectives, which align visual representations with linguistic space. While effective for multimodal reasoning, this alignment can weaken vision-centric capabilities, causing LVLMs to underperform their base vision encoders on tasks such as image classification. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a lightweight framework that integrates raw vision features with aligned LLM representations through vision-integration layers and a context-aware ensemble mechanism. This design enhances the model's ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations. Extensive experiments demonstrate that CARPE improves performance on both image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.