Efficient End-to-End Video Question Answering with Pyramidal Multimodal   Transformer

Min Peng; Chongyang Wang; Yu Shi; Xiang-Dong Zhou

arXiv:2302.02136·cs.CV·March 7, 2023·1 cites

Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Min Peng, Chongyang Wang, Yu Shi, Xiang-Dong Zhou

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a pyramidal multimodal transformer for end-to-end VideoQA that efficiently models multi-scale video-language interactions without relying on large pre-trained feature extractors, achieving competitive results.

Contribution

The paper proposes a novel pyramidal multimodal transformer with anisotropic pyramid structures and scale-specific interactions for efficient VideoQA.

Findings

01

Achieves comparable or better performance than state-of-the-art methods.

02

Demonstrates high computational efficiency and scalability.

03

Effective in text-to-video retrieval tasks.

Abstract

This paper presents a new method for end-to-end Video Question Answering (VideoQA), aside from the current popularity of using large-scale pre-training with huge feature extractors. We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer, a few convolutional and transformer layers. We use the anisotropic pyramid to fulfill video-language interactions across different spatio-temporal scales. In addition to the canonical pyramid, which includes both bottom-up and top-down pathways with lateral connections, novel strategies are proposed to decompose the visual feature stream into spatial and temporal sub-streams at different scales and implement their interactions with the linguistic semantics while preserving the integrity of local and global semantics. We demonstrate better or on-par performances with high…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

trunpm/pmt-aaai23
pytorchOfficial

Videos

Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning