IC-Custom: Diverse Image Customization via In-Context Learning

Yaowei Li; Xiaoyu Li; Zhaoyang Zhang; Yuxuan Bian; Gan Liu; Xinyuan Li; Jiale Xu; Wenbo Hu; Yating Liu; Lingen Li; Jing Cai; Yuexian Zou; Yancheng He; Ying Shan

arXiv:2507.01926·cs.CV·October 2, 2025

IC-Custom: Diverse Image Customization via In-Context Learning

Yaowei Li, Xiaoyu Li, Zhaoyang Zhang, Yuxuan Bian, Gan Liu, Xinyuan Li, Jiale Xu, Wenbo Hu, Yating Liu, Lingen Li, Jing Cai, Yuexian Zou, Yancheng He, Ying Shan

PDF

Open Access 1 Models 3 Reviews

TL;DR

IC-Custom introduces a unified in-context learning framework for diverse image customization tasks, effectively integrating position-aware and position-free methods with minimal additional training.

Contribution

The paper proposes IC-Custom, a novel framework that combines position-aware and position-free image customization using in-context multi-modal attention and a curated dataset.

Findings

01

Outperforms existing methods in identity consistency and harmony.

02

Achieves 73% higher human preference scores.

03

Requires only 0.4% of the original model parameters for training.

Abstract

Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome these limitations, we propose IC-Custom, a unified framework that seamlessly integrates position-aware and position-free image customization through in-context learning. IC-Custom concatenates reference images with target images to a polyptych, leveraging DiT's multi-modal attention mechanism for fine-grained token-level interactions. We propose the In-context Multi-Modal Attention (ICMA) mechanism, which employs learnable task-oriented register tokens and boundary-aware positional embeddings to…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper addresses the long-standing gap of isolated position-aware and position-free image customization by proposing a unified framework. 2. The method balances performance and efficiency: built on pre-trained FLUX.1-Fill, it only trains 0.4% of parameters via LoRA fine-tuning, while outperforming baselines on ProductBench and DreamBench. 3. Overall, the writing is clear.

Weaknesses

1. Limited Scenario Generalization: IC-Custom only validates single-object customization. Its performance in multi-object scenarios (e.g., inserting a backpack and a laptop simultaneously) remains untested. 2. The paper does not mention any performance in the absence of text input. Position-aware customization relies on text to define identity and scene constraints. How the model behaves when it lacks clear text guidance is unclear.

Reviewer 02Rating 6Confidence 2

Strengths

+ Elegant unification: A clear and practical integration of position-aware and position-free customization within one architecture, avoiding the need for separate specialized models. + Well-designed ICMA mechanism: Task tokens and boundary embeddings effectively address task ambiguity and spatial confusion, with strong ablations supporting their contribution. + Comprehensive evaluation: Extensive comparisons show consistent superiority over GPT-4o and recent open-source baselines, supported by

Weaknesses

+ The “In-context learning” framing is somewhat overstated—the model is trained end-to-end rather than showing adaptive few-shot behavior. + There is no explicit modeling of 3-D or geometric consistency; limited robustness tests for large viewpoint changes.

Reviewer 03Rating 6Confidence 3

Strengths

1. Novel unification of two customization paradigms using in-context learning on DiT, with innovative ICMA incorporating task-specific register tokens and positional embeddings to handle ambiguity. 2. Robust dataset curation (12K samples, real+synthetic) and comprehensive evaluations, including ablations that validate key components. 3. Intuitive diptych formulation and clear training strategy enable seamless handling of diverse tasks without separate models. 4. Advances efficient, identity-c

Weaknesses

1. No quantitative analysis of inference time, failure cases beyond visuals. The paper does not address how long the model takes to generate results, nor does it provide a comparison to baseline models in terms of speed. 2. Unclear Handling of Multi-Reference Customization. The paper briefly mentions multi-reference customization as future work, but does not provide sufficient detail on how this will be incorporated into the current framework. 3. Limited Exploration of Dataset Diversity. The

Code & Models

Models

🤗
TencentARC/IC-Custom
model· 15 dl· ♡ 16
15 dl♡ 16

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Multimodal Machine Learning Applications