COOT: Cooperative Hierarchical Transformer for Video-Text Representation   Learning

Simon Ging (1); Mohammadreza Zolfaghari (1); Hamed Pirsiavash (2),; Thomas Brox (1) ((1) University of Freiburg; (2) University of Maryland; Baltimore County)

arXiv:2011.00597·cs.CV·November 3, 2020·75 cites

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Simon Ging (1), Mohammadreza Zolfaghari (1), Hamed Pirsiavash (2),, Thomas Brox (1) ((1) University of Freiburg, (2) University of Maryland, Baltimore County)

PDF

Open Access 1 Repo 1 Video

TL;DR

COOT introduces a hierarchical transformer model that effectively captures multi-level semantics and cross-modal interactions in video-text tasks, achieving state-of-the-art results with fewer parameters.

Contribution

It proposes a novel hierarchical transformer architecture with attention-aware aggregation, inter-level interaction modeling, and cycle-consistency loss for improved video-text understanding.

Findings

01

Outperforms state-of-the-art on multiple benchmarks.

02

Uses fewer parameters than comparable models.

03

Demonstrates effective multi-level semantic modeling.

Abstract

Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gingsi/coot-videotext
pytorchOfficial

Videos

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Dropout · Multi-Head Attention · Byte Pair Encoding · Softmax · Dense Connections