Top-down Activity Representation Learning for Video Question Answering

Yanan Wang; Shuichiro Haruta; Donghuo Zeng; Julio Vizcarra; and Mori Kurokawa

arXiv:2409.07748·cs.CV·September 16, 2024

Top-down Activity Representation Learning for Video Question Answering

Yanan Wang, Shuichiro Haruta, Donghuo Zeng, Julio Vizcarra, and Mori Kurokawa

PDF

Open Access

TL;DR

This paper introduces a novel top-down activity representation learning method for VideoQA, converting long-term videos into spatial images and fine-tuning multimodal models to better capture hierarchical activities and contextual events.

Contribution

It proposes a new approach that leverages spatial image domain conversion and fine-tuning of multimodal models to improve VideoQA performance on complex hierarchical activities.

Findings

01

Achieved 78.4% accuracy on the STAR task, surpassing previous state-of-the-art by 2.8 points.

02

Effectively captures non-continuous contextual events in videos.

03

Demonstrates the benefit of converting long-term video sequences into spatial images for VideoQA.

Abstract

Capturing complex hierarchical human activities, from atomic actions (e.g., picking up one present, moving to the sofa, unwrapping the present) to contextual events (e.g., celebrating Christmas) is crucial for achieving high-performance video question answering (VideoQA). Recent works have expanded multimodal models (e.g., CLIP, LLaVA) to process continuous video sequences, enhancing the model's temporal reasoning capabilities. However, these approaches often fail to capture contextual events that can be decomposed into multiple atomic actions non-continuously distributed over relatively long-term sequences. In this paper, to leverage the spatial visual context representation capability of the CLIP model for obtaining non-continuous visual representations in terms of contextual events in videos, we convert long-term video sequences into a spatial image domain and finetune the multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training