OmChat: A Recipe to Train Multimodal Language Models with Strong Long   Context and Video Understanding

Tiancheng Zhao; Qianqian Zhang; Kyusong Lee; Peng Liu; Lu Zhang,; Chunxin Fang; Jiajia Liao; Kelei Jiang; Yibo Ma; Ruochen Xu

arXiv:2407.04923·cs.CV·July 9, 2024

OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding

Tiancheng Zhao, Qianqian Zhang, Kyusong Lee, Peng Liu, Lu Zhang,, Chunxin Fang, Jiajia Liao, Kelei Jiang, Yibo Ma, Ruochen Xu

PDF

Open Access 1 Models

TL;DR

OmChat is a multimodal language model capable of understanding long contexts and videos, utilizing a novel architecture, dynamic vision encoding, and progressive pretraining to outperform existing models in complex visual tasks.

Contribution

The paper introduces OmChat, a new multimodal model with support for 512K context length, a standardized visual processing architecture, and an active progressive pretraining strategy for improved long-term and video understanding.

Findings

01

Supports context length up to 512K for complex tasks.

02

Outperforms most open-source models in multimodal benchmarks.

03

Proposes a new benchmark dataset for temporal visual understanding.

Abstract

We introduce OmChat, a model designed to excel in handling long contexts and video understanding tasks. OmChat's new architecture standardizes how different visual inputs are processed, making it more efficient and adaptable. It uses a dynamic vision encoding process to effectively handle images of various resolutions, capturing fine details across a range of image qualities. OmChat utilizes an active progressive multimodal pretraining strategy, which gradually increases the model's capacity for long contexts and enhances its overall abilities. By selecting high-quality data during training, OmChat learns from the most relevant and informative data points. With support for a context length of up to 512K, OmChat demonstrates promising performance in tasks involving multiple images and videos, outperforming most open-source models in these benchmarks. Additionally, OmChat proposes a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
omlab/omchat-v2.0-13B-single-beta_hf
model· 19 dl· ♡ 5
19 dl♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications