OmAgent: A Multi-modal Agent Framework for Complex Video Understanding   with Task Divide-and-Conquer

Lu Zhang; Tiancheng Zhao; Heting Ying; Yibo Ma; Kyusong Lee

arXiv:2406.16620·cs.CV·November 13, 2024

OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

Lu Zhang, Tiancheng Zhao, Heting Ying, Yibo Ma, Kyusong Lee

PDF

Open Access 2 Repos 1 Models 1 Video

TL;DR

OmAgent introduces a multi-modal framework that efficiently processes and understands complex videos by combining intelligent frame retrieval, divide-and-conquer reasoning, and autonomous tool invocation, significantly improving video comprehension accuracy.

Contribution

It presents OmAgent, a novel system that enhances video understanding through efficient data management, autonomous reasoning, and dynamic API integration, addressing limitations of traditional methods.

Findings

01

OmAgent effectively handles 24-hour CCTV footage and full-length films.

02

It significantly reduces information loss compared to traditional frame extraction methods.

03

Experimental results demonstrate improved accuracy and robustness in complex video tasks.

Abstract

Recent advancements in Large Language Models (LLMs) have expanded their capabilities to multimodal contexts, including comprehensive video understanding. However, processing extensive videos such as 24-hour CCTV footage or full-length films presents significant challenges due to the vast data and processing demands. Traditional methods, like extracting key frames or converting frames to text, often result in substantial information loss. To address these shortcomings, we develop OmAgent, efficiently stores and retrieves relevant video frames for specific queries, preserving the detailed content of videos. Additionally, it features an Divide-and-Conquer Loop capable of autonomous reasoning, dynamically invoking APIs and tools to enhance query processing and accuracy. This approach ensures robust video understanding, significantly reducing information loss. Experimental results affirm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
omlab/omchat-v2.0-13B-single-beta_hf
model· 19 dl· ♡ 5
19 dl♡ 5

Videos

OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Image Retrieval and Classification Techniques