OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding
Tiancheng Zhao, Qianqian Zhang, Kyusong Lee, Peng Liu, Lu Zhang,, Chunxin Fang, Jiajia Liao, Kelei Jiang, Yibo Ma, Ruochen Xu

TL;DR
OmChat is a multimodal language model capable of understanding long contexts and videos, utilizing a novel architecture, dynamic vision encoding, and progressive pretraining to outperform existing models in complex visual tasks.
Contribution
The paper introduces OmChat, a new multimodal model with support for 512K context length, a standardized visual processing architecture, and an active progressive pretraining strategy for improved long-term and video understanding.
Findings
Supports context length up to 512K for complex tasks.
Outperforms most open-source models in multimodal benchmarks.
Proposes a new benchmark dataset for temporal visual understanding.
Abstract
We introduce OmChat, a model designed to excel in handling long contexts and video understanding tasks. OmChat's new architecture standardizes how different visual inputs are processed, making it more efficient and adaptable. It uses a dynamic vision encoding process to effectively handle images of various resolutions, capturing fine details across a range of image qualities. OmChat utilizes an active progressive multimodal pretraining strategy, which gradually increases the model's capacity for long contexts and enhances its overall abilities. By selecting high-quality data during training, OmChat learns from the most relevant and informative data points. With support for a context length of up to 512K, OmChat demonstrates promising performance in tasks involving multiple images and videos, outperforming most open-source models in these benchmarks. Additionally, OmChat proposes a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
