STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training
Haiyi Qiu, Minghe Gao, Long Qian, Kaihang Pan, Qifan Yu, Juncheng Li,, Wenjie Wang, Siliang Tang, Yueting Zhuang, Tat-Seng Chua

TL;DR
This paper introduces STEP, a graph-guided self-training method that significantly improves Video-LLMs' multi-step spatio-temporal reasoning by generating reasoning-rich training data from raw videos, enhancing their compositional understanding.
Contribution
The paper presents a novel approach to automatically generate reasoning-focused training data for Video-LLMs using spatio-temporal scene graphs and chain-of-thought rationales, reducing manual effort and data scarcity.
Findings
21.3% improvement in multi-step reasoning tasks
Effective with minimal self-generated training data
Enhances compositional reasoning and understanding
Abstract
Video Large Language Models (Video-LLMs) have recently shown strong performance in basic video understanding tasks, such as captioning and coarse-grained question answering, but struggle with compositional reasoning that requires multi-step spatio-temporal inference across object relations, interactions, and events. The hurdles to enhancing this capability include extensive manual labor, the lack of spatio-temporal compositionality in existing data and the absence of explicit reasoning supervision. In this paper, we propose STEP, a novel graph-guided self-training method that enables Video-LLMs to generate reasoning-rich fine-tuning data from any raw videos to improve itself. Specifically, we first induce Spatio-Temporal Scene Graph (STSG) representation of diverse videos to capture fine-grained, multi-granular video semantics. Then, the STSGs guide the derivation of multi-step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
