FocalClick-XL: Towards Unified and High-quality Interactive Segmentation
Xi Chen, Hengshuang Zhao

TL;DR
FocalClick-XL introduces a multi-stage, large-scale pretraining approach for interactive segmentation, enabling support for diverse interaction types and fine-grained mask predictions, achieving state-of-the-art results.
Contribution
It extends the classical FocalClick design with a novel pipeline that decomposes segmentation into meta-tasks, each pretrained independently, enhancing flexibility and performance.
Findings
State-of-the-art on click-based benchmarks
Supports diverse interaction formats including boxes and scribbles
Capable of predicting detailed alpha mattes
Abstract
Interactive segmentation enables users to extract binary masks of target objects through simple interactions such as clicks, scribbles, and boxes. However, existing methods often support only limited interaction forms and struggle to capture fine details. In this paper, we revisit the classical coarse-to-fine design of FocalClick and introduce significant extensions. Inspired by its multi-stage strategy, we propose a novel pipeline, FocalClick-XL, to address these challenges simultaneously. Following the emerging trend of large-scale pretraining, we decompose interactive segmentation into meta-tasks that capture different levels of information -- context, object, and detail -- assigning a dedicated subnet to each level.This decomposition allows each subnet to undergo scaled pretraining with independent data and supervision, maximizing its effectiveness. To enhance flexibility, we share…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
