STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal   Graph-guided Self-Training

Haiyi Qiu; Minghe Gao; Long Qian; Kaihang Pan; Qifan Yu; Juncheng Li,; Wenjie Wang; Siliang Tang; Yueting Zhuang; Tat-Seng Chua

arXiv:2412.00161·cs.CV·April 1, 2025

STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

Haiyi Qiu, Minghe Gao, Long Qian, Kaihang Pan, Qifan Yu, Juncheng Li,, Wenjie Wang, Siliang Tang, Yueting Zhuang, Tat-Seng Chua

PDF

Open Access

TL;DR

This paper introduces STEP, a graph-guided self-training method that significantly improves Video-LLMs' multi-step spatio-temporal reasoning by generating reasoning-rich training data from raw videos, enhancing their compositional understanding.

Contribution

The paper presents a novel approach to automatically generate reasoning-focused training data for Video-LLMs using spatio-temporal scene graphs and chain-of-thought rationales, reducing manual effort and data scarcity.

Findings

01

21.3% improvement in multi-step reasoning tasks

02

Effective with minimal self-generated training data

03

Enhances compositional reasoning and understanding

Abstract

Video Large Language Models (Video-LLMs) have recently shown strong performance in basic video understanding tasks, such as captioning and coarse-grained question answering, but struggle with compositional reasoning that requires multi-step spatio-temporal inference across object relations, interactions, and events. The hurdles to enhancing this capability include extensive manual labor, the lack of spatio-temporal compositionality in existing data and the absence of explicit reasoning supervision. In this paper, we propose STEP, a novel graph-guided self-training method that enables Video-LLMs to generate reasoning-rich fine-tuning data from any raw videos to improve itself. Specifically, we first induce Spatio-Temporal Scene Graph (STSG) representation of diverse videos to capture fine-grained, multi-granular video semantics. Then, the STSGs guide the derivation of multi-step…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications