From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

Carlos Schmidt; Simon Rei{\ss}

arXiv:2604.06748·cs.CV·April 9, 2026

From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

Carlos Schmidt, Simon Rei{\ss}

PDF

TL;DR

This paper introduces a method to transform static visual in-context learners into interactive, user-guided systems, enabling real-time visual task control through natural cues without fine-tuning.

Contribution

It proposes a simple approach to encode user interactions into in-context learning models, significantly enhancing their controllability and practical usability in real-world scenarios.

Findings

01

Outperforms state-of-the-art models in leveraging user interactions.

02

Achieves +7.95% IoU in interactive segmentation.

03

Improves PSNR by +2.46 in directed super-resolution.

Abstract

Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.