TL;DR
This paper introduces Live Interactive Training (LIT), a framework enabling video segmentation models to learn online from human corrections, significantly reducing user effort and improving performance in challenging scenarios.
Contribution
The authors propose LIT-LoRA, a lightweight on-the-fly training method that adapts models during inference, demonstrating substantial correction reduction and extending to other segmentation and classification tasks.
Findings
Achieves 18-34% reduction in total corrections on benchmarks.
Training overhead is minimal, around 0.5 seconds per correction.
Demonstrates generality by adapting to other models and tasks.
Abstract
Interactive video segmentation often requires many user interventions for robust performance in challenging scenarios (e.g., occlusions, object separations, camouflage, etc.). Yet, even state-of-the-art models like SAM2 use corrections only for immediate fixes without learning from this feedback, leading to inefficient, repetitive user effort. To address this, we introduce Live Interactive Training (LIT), a novel framework for prompt-based visual systems where models also learn online from human corrections at inference time. Our primary instantiation, LIT-LoRA, implements this by continually updating a lightweight LoRA module on-the-fly. When a user provides a correction, this module is rapidly trained on that feedback, allowing the vision system to improve performance on subsequent frames of the same video. Leveraging the core principles of LIT, our LIT-LoRA implementation achieves an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
