OMCAT: Omni Context Aware Transformer

Arushi Goel; Karan Sapra; Matthieu Le; Rafael Valle; Andrew Tao; Bryan; Catanzaro

arXiv:2410.12109·cs.CL·October 17, 2024

OMCAT: Omni Context Aware Transformer

Arushi Goel, Karan Sapra, Matthieu Le, Rafael Valle, Andrew Tao, Bryan, Catanzaro

PDF

Open Access 3 Reviews

TL;DR

OMCAT introduces a novel transformer model with rotary time embeddings and a new dataset, OCTAV, to improve fine-grained cross-modal temporal understanding in audio-visual tasks, achieving state-of-the-art results.

Contribution

The paper presents OMCAT, a new model with rotary time embeddings, and OCTAV, a dataset for cross-modal temporal event understanding, advancing multimodal temporal reasoning.

Findings

01

OMCAT achieves state-of-the-art on AVQA tasks.

02

The OCTAV dataset effectively captures event transitions across modalities.

03

Rotary Time Embeddings improve temporal grounding efficiency.

Abstract

Large Language Models (LLMs) have made significant strides in text generation and comprehension, with recent advancements extending into multimodal LLMs that integrate visual and audio inputs. However, these models continue to struggle with fine-grained, cross-modal temporal understanding, particularly when correlating events across audio and video streams. We address these challenges with two key contributions: a new dataset and model, called OCTAV and OMCAT respectively. OCTAV (Omni Context and Temporal Audio Video) is a novel dataset designed to capture event transitions across audio and video. Second, OMCAT (Omni Context Aware Transformer) is a powerful model that leverages RoTE (Rotary Time Embeddings), an innovative extension of RoPE, to enhance temporal grounding and computational efficiency in time-anchored tasks. Through a robust three-stage training pipeline-feature alignment,…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 8Confidence 3

Strengths

The paper addresses a clear gap in cross-modal temporal understanding, providing new opportunities for research in integrating audio and video modalities and giving related solutions and datasets. The approach is soundness: 1. The use of RoTE (Rotary Time Embeddings) as an extension of RoPE to enhance temporal grounding is a creative advancement that could inspire further research in temporal embeddings. 2. The multi-training-stage ensures thorough model development and evaluation The experime

Weaknesses

1. The synthetic nature of the OCTAV dataset may limit the performance of the model in real-world applications. 2. The technical details of methods such as RoTE are not explained in detail, which may affect understanding and reproducibility. 3. RoTE is claimed to have a better ability to capture actual elapsed between frames while lacking more comparison of how fine-grand it can do. 4. The dataset is comprehensive but including also the evaluation dataset. This raises the concern of proposed

Reviewer 02Rating 5Confidence 4

Strengths

- The authors' construction of OCTAV to capture the transition between audio and video is interesting and seems effective. A lot of methods in literature create videos that have time-aligned audio and video data that correspond to the same event. Here, the authors insert arbitrary audio into the video and pose new problems for training.

Weaknesses

- The method innovation is somehow mediocre. The RoTE over RoPE seems to be a small but effective improvement. However, given the recent MLLM progress, the staged training approach seems standard. Interleaving tokens is effective, but others are also employing this technique. - The evaluation doesn’t fully support the claims that OMCAT has achieved significant gains in temporal reasoning and cross-modal alignment. First, a lot of entries in Table 5 are missing. This is quite unfortunate to dra

Reviewer 03Rating 6Confidence 4

Strengths

1. Clear writing with a logical structure. 2. Innovative ideas that incorporate the latest research findings or technologies, presenting a unique perspective. 3. Reproducible results by providing detailed experimental methods or steps, allowing other researchers to replicate the experiments.

Weaknesses

1. The experimental results are insufficient, and there is no comparison regarding the understanding ability of audio. 2. How can the quality of the generated dataset be determined? 3. In the appendix, the video understanding experiments, such as MSRVTT-QA, MSVD-QA, and ActivityNet-QA, show average performance. 4. The comparative experiments in Table 8 are not entirely fair. The data used for OMCAT with only LP, WC, and V is 430.7k less than the total dataset, so the performance of the first row

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsContext-Aware Activity Recognition Systems

MethodsAttentive Walk-Aggregating Graph Neural Network