Dense Video Object Captioning from Disjoint Supervision
Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

TL;DR
This paper introduces a unified end-to-end model for dense video object captioning that detects, tracks, and describes objects in videos, leveraging disjoint supervision from large datasets to improve accuracy and zero-shot capabilities.
Contribution
The paper presents a novel unified model and training strategy for dense video object captioning, enabling better temporal coherence and zero-shot performance compared to multi-stage pipelines.
Findings
Outperforms strong baselines on dense video object captioning.
Achieves state-of-the-art results on spatial grounding tasks.
Leverages disjoint datasets for effective training and zero-shot generalization.
Abstract
We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video. This task unifies spatial and temporal localization in video, whilst also requiring fine-grained visual understanding that is best described by natural language. We propose a unified model, and demonstrate how our end-to-end approach is more accurate and temporally coherent than a multi-stage pipeline combining state-of-the-art detection, tracking, and captioning models. Moreover, we propose a training strategy based on a mixture of disjoint tasks, which allows us to leverage diverse, large-scale datasets which supervise different parts of our model. Although each pretraining task only provides weak supervision, they are complementary and, when combined, result in noteworthy zero-shot ability and serve as strong initialization for additional…
Peer Reviews
Decision·ICLR 2025 Spotlight
1. **Clear and Precise Definition of a New Benchmark**: The paper introduces a novel task aimed at achieving more comprehensive video understanding, thereby setting higher standards for a single model's capability to interpret videos. It also establishes well-defined evaluation metrics that are appropriate for this new benchmark. 2. **Design of a Concise and Effective Training Framework**: In response to the proposed task, the paper presents a streamlined, end-to-end trainable framework that ef
1. **Clarification of Task Significance**: Although the task imposes higher demands on deep models for video understanding, the paper does not clearly articulate the associated benefits. It would be advantageous to illustrate the task's relevance in more challenging or representative application scenarios, such as sports commentary or intelligent animal monitoring. This would better underscore its significance and research value. 2. **Examination of Method Generalizability**: The authors succes
This paper introduces a novel task—dense video object captioning with unified spatial and temporal localization—bridging video understanding and natural language description in a unique way. The end-to-end model presented is robust, outperforming multi-stage pipelines by integrating detection, tracking, and captioning into a cohesive approach that achieves notable zero-shot capabilities. The authors clearly outline the model architecture, training strategy, and metrics, making complex componen
The paper primarily focuses on a few datasets, which may not fully represent the diversity of real-world scenarios. Expanding evaluations across a broader range of benchmarks could strengthen the validity of the results. While quantitative metrics are important, including qualitative evaluations—such as human assessments of captioning quality—could enrich the understanding of model performance and highlight potential areas for improvement. The paper does not provide an in-depth analysis of fai
This work tackles an important problem that was missing from the literature. The paper is well-written and easy to follow. Extensive experiments have been performed.
My only concern is that the end-to-end tracking algorithm (listed as a contribution) seems to be naive and not novel enough. There are other methods that perform identity association within the model (like MinVIS[1], CAROQ [2], trackformer [3]). The authors have explored some other ways of integrating temporal information/tracking in Table 2, but how about different ways of association within the model (e.g., query vector propagation in trackformer [3])? [1] MinVIS: A Minimal Video Instance Se
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
MethodsAttentive Walk-Aggregating Graph Neural Network
