PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment
Tianci Luo, Jinpeng Wang, Shiyu Qin, Niu Lian, Yan Feng, Bin Chen, Chun Yuan, Shu-Tao Xia

TL;DR
PromptHub introduces a locality-aware fusion framework for visual in-context learning, leveraging spatial priors and multiple objectives to improve performance, robustness, and transferability across vision tasks and scenarios.
Contribution
It presents a novel locality-aware paradigm for multi-prompt fusion in VICL, surpassing patch-wise methods with holistic spatial priors and guided training objectives.
Findings
Outperforms existing methods on three vision tasks
Demonstrates strong transferability and robustness in OOD settings
Shows universality across various retrieval scenarios
Abstract
Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of…
Peer Reviews
Decision·ICLR 2026 Poster
- The concept of utilizing local-aware aggregation for prompt fusion and the design of learning objectives is illustrative and aligned with the original motivation. - The performance improvement is significant compared to competitors. - The writing is generally well-structured and easy to understand.
- **Interpretability**: As shown in Figure 6, the fused prompts of Condenser align closely with the target regions of the queries. In contrast, the generated object contours in PromptHub are misaligned with the queries. To improve clarity, it would be beneficial to include additional visual results, such as attention maps of the prompt regions similar to those in Figure 13. - The cross-attention operation is constrained by the coordinate position. What would happen if the prompt pairs have dist
1) The task of generating optimal prompt in VICL is a hot topic in the community. And the results of PormptHub show the improvements of this method. 2) Three new learning objectives are developed to train the fusing process.
1), One of the main concerns lies in the novelty of the work. Given that the previous CONDENSER model also generates fusing prompts from the query image and N pairs, I find that the locality-aware attention and training losses developed in PromptHub may limit the novelty of this paper. I hope the authors can clarify the core differences between the previous work and PromptHub in terms of the main motivation, fusing strategy, and training process. 2), Table 3 presents ablation results demonstrat
1. The overall idea is well-motivated, as the paper clearly identifies the limitation of patch-wise fusion in existing multi-prompt VICL methods and introduces a locality-aware fusion mechanism that intuitively enhances spatial coherence and prompt utilization. 2. Extensive experiments showing consistent gains across segmentation, detection, and colorization.
1. The claim that PromptHub ‘establishes an interpretable paradigm’ is somewhat overstated. Moreover, the analysis of interpretability remains superficial, lacking quantitative evidence or deeper reasoning about what the fused representations capture. 2. The visualization results are not entirely convincing. For instance, in some examples (e.g., prompt 1), background regions such as the sky or ground areas from unrelated prompts (e.g., prompt 2) are also highlighted, suggesting that the fusion
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Visual Attention and Saliency Detection
