TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration

Yanshu Li; Jianjiang Yang; Tian Yun; Pinyuan Feng; Jinfa Huang; Ruixiang Tang

arXiv:2505.17098·cs.CL·October 22, 2025

TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration

Yanshu Li, Jianjiang Yang, Tian Yun, Pinyuan Feng, Jinfa Huang, Ruixiang Tang

PDF

1 Video

TL;DR

TACO introduces a task mapping-guided approach to dynamically configure multimodal in-context learning sequences, significantly improving model reasoning and performance across diverse vision-language tasks.

Contribution

It proposes a novel task-aware attention mechanism in a lightweight transformer to enhance sequence configuration in multimodal ICL, bridging interpretability and performance.

Findings

01

TACO outperforms baselines on multiple LVLMs and datasets.

02

Task mapping improves understanding and effectiveness of ICL sequences.

03

Dynamic sequence configuration enhances reasoning in multimodal tasks.

Abstract

Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input ICL sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures ICL sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration· underline