Advancing Complex Video Object Segmentation via Progressive Concept Construction

Zhixiong Zhang; Shuangrui Ding; Xiaoyi Dong; Songxin He; Jianfan Lin; Junsong Tang; Yuhang Zang; Yuhang Cao; Dahua Lin; Jiaqi Wang

arXiv:2507.15852·cs.CV·March 3, 2026

Advancing Complex Video Object Segmentation via Progressive Concept Construction

Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang

PDF

Open Access 3 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces Segment Concept (SeC), a novel VOS framework leveraging high-level object-centric representations and large vision-language models to improve segmentation in complex, concept-rich video scenarios.

Contribution

SeC shifts from feature matching to progressive concept construction using LVLMs, and introduces SeCVOS, a benchmark for high-level reasoning in video object segmentation.

Findings

01

SeC outperforms state-of-the-art methods on SeCVOS and standard benchmarks.

02

Achieves 11.8-point improvement over SAM 2.1 on SeCVOS.

03

Demonstrates robustness in scenarios with appearance variations and scene changes.

Abstract

We propose Segment Concept (SeC), a concept-driven video object segmentation (VOS) framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. To balance semantic reasoning with computational overhead, SeC forwards the LVLMs only when a new scene appears, injecting concept-level features at those points. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. This paper leverages LVLM-derived (InternVL-2.5) object-level embeddings for video object segmentation task. It combines low-level visual similarity and high-level semantic similarity to improve segmentation performance on challenging cases. 2. This paper proposes a SeCVOS Benchmark, designed for semantic complex scenarios. 3. Keeping only the transition frames in the concept memory bank is both efficient and effective, as it focuses on the most informative moments of semantic change while re

Weaknesses

1. ***Table 5:*** Model parameters and the inference time in Table 5 should be included for clear and fair comparison. 2. In the qualitative analysis, the results of **SAMURAI** and **SAM2-Long** should also be included in Fig. 5 of the main paper and figures/videos in Supple materials. 3. Some failure cases should be included in Sec. E of supple. materials to better illustrate the limitations of the proposed method and provide insights into potential areas for improvement.

Reviewer 02Rating 8Confidence 3

Strengths

- The paper introduces a creative method that uses large vision-language models to understand objects by concepts instead of only appearances, showing strong technical quality. - It is clearly written and well-tested, and the new SeCVOS dataset plus strong results make it important for advancing video object segmentation research.

Weaknesses

When the video is too long, the memory bank fills up rapidly, which increases the computational cost. Currently, it uses a FIFO method to limit the buffer size. It lacks a clear strategy for summarizing or compressing the memory bank. Exploring efficient memory summarization could further enhance scalability for long videos.

Reviewer 03Rating 4Confidence 5

Strengths

The statement is clear, and the figures are in good illustration.

Weaknesses

1. Novelty is a big issue. Token-level video summarization has been widely exploited for long-term video understanding tasks, such as [1,2,3]. [1]Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method [2] InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling. [3] Streaming Long Video Understanding with Large Language Models, neurips 2. Fairness about the selected dataset called SeCVOS benchmark. This dataset is small with only 160 manually video

Code & Models

Models

Datasets

OpenIXCLab/SeCVOS
dataset· 119 dl
119 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Multimodal Machine Learning Applications