Efficient Multi-modal Long Context Learning for Training-free Adaptation

Zehong Ma; Shiliang Zhang; Longhui Wei; Qi Tian

arXiv:2505.19812·cs.CV·May 27, 2025

Efficient Multi-modal Long Context Learning for Training-free Adaptation

Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian

PDF

Open Access 1 Repo

TL;DR

This paper introduces EMLoC, a training-free method for multi-modal long context learning that compresses and prunes inputs to enable efficient, scalable, and high-performance task adaptation without fine-tuning.

Contribution

EMLoC is the first to combine compression and adaptive pruning for multi-modal long-context learning, enabling efficient, training-free model adaptation.

Findings

01

Achieves comparable or superior performance to naive long-context methods.

02

Significantly reduces inference complexity through token pruning.

03

Demonstrates effectiveness across diverse vision-language benchmarks.

Abstract

Traditional approaches to adapting multi-modal large language models (MLLMs) to new tasks have relied heavily on fine-tuning. This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC), a novel training-free alternative that embeds demonstration examples directly into the model input. EMLoC offers a more efficient, flexible, and scalable solution for task adaptation. Because extremely lengthy inputs introduce prohibitive computational and memory overhead, EMLoC contributes a chunk-wise compression mechanism combined with layer-wise adaptive pruning. It condenses long-context multimodal inputs into compact, task-specific memory representations. By adaptively pruning tokens at each layer under a Jensen-Shannon divergence constraint, our method achieves a dramatic reduction in inference complexity without sacrificing performance. This approach is the first to seamlessly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zehong-ma/emloc
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Enhancement Techniques · Multimodal Machine Learning Applications

MethodsPruning