Efficient Multi-modal Long Context Learning for Training-free Adaptation
Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian

TL;DR
This paper introduces EMLoC, a training-free method for multi-modal long context learning that compresses and prunes inputs to enable efficient, scalable, and high-performance task adaptation without fine-tuning.
Contribution
EMLoC is the first to combine compression and adaptive pruning for multi-modal long-context learning, enabling efficient, training-free model adaptation.
Findings
Achieves comparable or superior performance to naive long-context methods.
Significantly reduces inference complexity through token pruning.
Demonstrates effectiveness across diverse vision-language benchmarks.
Abstract
Traditional approaches to adapting multi-modal large language models (MLLMs) to new tasks have relied heavily on fine-tuning. This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC), a novel training-free alternative that embeds demonstration examples directly into the model input. EMLoC offers a more efficient, flexible, and scalable solution for task adaptation. Because extremely lengthy inputs introduce prohibitive computational and memory overhead, EMLoC contributes a chunk-wise compression mechanism combined with layer-wise adaptive pruning. It condenses long-context multimodal inputs into compact, task-specific memory representations. By adaptively pruning tokens at each layer under a Jensen-Shannon divergence constraint, our method achieves a dramatic reduction in inference complexity without sacrificing performance. This approach is the first to seamlessly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Enhancement Techniques · Multimodal Machine Learning Applications
MethodsPruning
