OMCAT: Omni Context Aware Transformer
Arushi Goel, Karan Sapra, Matthieu Le, Rafael Valle, Andrew Tao, Bryan, Catanzaro

TL;DR
OMCAT introduces a novel transformer model with rotary time embeddings and a new dataset, OCTAV, to improve fine-grained cross-modal temporal understanding in audio-visual tasks, achieving state-of-the-art results.
Contribution
The paper presents OMCAT, a new model with rotary time embeddings, and OCTAV, a dataset for cross-modal temporal event understanding, advancing multimodal temporal reasoning.
Findings
OMCAT achieves state-of-the-art on AVQA tasks.
The OCTAV dataset effectively captures event transitions across modalities.
Rotary Time Embeddings improve temporal grounding efficiency.
Abstract
Large Language Models (LLMs) have made significant strides in text generation and comprehension, with recent advancements extending into multimodal LLMs that integrate visual and audio inputs. However, these models continue to struggle with fine-grained, cross-modal temporal understanding, particularly when correlating events across audio and video streams. We address these challenges with two key contributions: a new dataset and model, called OCTAV and OMCAT respectively. OCTAV (Omni Context and Temporal Audio Video) is a novel dataset designed to capture event transitions across audio and video. Second, OMCAT (Omni Context Aware Transformer) is a powerful model that leverages RoTE (Rotary Time Embeddings), an innovative extension of RoPE, to enhance temporal grounding and computational efficiency in time-anchored tasks. Through a robust three-stage training pipeline-feature alignment,…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper addresses a clear gap in cross-modal temporal understanding, providing new opportunities for research in integrating audio and video modalities and giving related solutions and datasets. The approach is soundness: 1. The use of RoTE (Rotary Time Embeddings) as an extension of RoPE to enhance temporal grounding is a creative advancement that could inspire further research in temporal embeddings. 2. The multi-training-stage ensures thorough model development and evaluation The experime
1. The synthetic nature of the OCTAV dataset may limit the performance of the model in real-world applications. 2. The technical details of methods such as RoTE are not explained in detail, which may affect understanding and reproducibility. 3. RoTE is claimed to have a better ability to capture actual elapsed between frames while lacking more comparison of how fine-grand it can do. 4. The dataset is comprehensive but including also the evaluation dataset. This raises the concern of proposed
- The authors' construction of OCTAV to capture the transition between audio and video is interesting and seems effective. A lot of methods in literature create videos that have time-aligned audio and video data that correspond to the same event. Here, the authors insert arbitrary audio into the video and pose new problems for training.
- The method innovation is somehow mediocre. The RoTE over RoPE seems to be a small but effective improvement. However, given the recent MLLM progress, the staged training approach seems standard. Interleaving tokens is effective, but others are also employing this technique. - The evaluation doesn’t fully support the claims that OMCAT has achieved significant gains in temporal reasoning and cross-modal alignment. First, a lot of entries in Table 5 are missing. This is quite unfortunate to dra
1. Clear writing with a logical structure. 2. Innovative ideas that incorporate the latest research findings or technologies, presenting a unique perspective. 3. Reproducible results by providing detailed experimental methods or steps, allowing other researchers to replicate the experiments.
1. The experimental results are insufficient, and there is no comparison regarding the understanding ability of audio. 2. How can the quality of the generated dataset be determined? 3. In the appendix, the video understanding experiments, such as MSRVTT-QA, MSVD-QA, and ActivityNet-QA, show average performance. 4. The comparative experiments in Table 8 are not entirely fair. The data used for OMCAT with only LP, WC, and V is 430.7k less than the total dataset, so the performance of the first row
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems
MethodsAttentive Walk-Aggregating Graph Neural Network
