From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks
Carlos Schmidt, Simon Rei{\ss}

TL;DR
This paper introduces a method to transform static visual in-context learners into interactive, user-guided systems, enabling real-time visual task control through natural cues without fine-tuning.
Contribution
It proposes a simple approach to encode user interactions into in-context learning models, significantly enhancing their controllability and practical usability in real-world scenarios.
Findings
Outperforms state-of-the-art models in leveraging user interactions.
Achieves +7.95% IoU in interactive segmentation.
Improves PSNR by +2.46 in directed super-resolution.
Abstract
Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
