CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong; Weihan Wang; Ming Ding; Wenmeng Yu; Qingsong Lv; Yan Wang,; Yean Cheng; Shiyu Huang; Junhui Ji; Zhao Xue; Lei Zhao; Zhuoyi Yang; Xiaotao; Gu; Xiaohan Zhang; Guanyu Feng; Da Yin; Zihan Wang; Ji Qi; Xixuan Song; Peng; Zhang; Debing Liu; Bin Xu; Juanzi Li; Yuxiao Dong; Jie Tang

arXiv:2408.16500·cs.CV·August 30, 2024·6 cites

CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang,, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao, Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng, Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong

PDF

Open Access 3 Repos 3 Models

TL;DR

CogVLM2 is a new family of visual language models that significantly improve image and video understanding by supporting higher resolutions, integrating multi-frame inputs, and achieving state-of-the-art results on multiple benchmarks.

Contribution

The paper introduces CogVLM2, a novel generation of visual language models with enhanced architecture, training methods, and capabilities for both image and video understanding, including automated temporal grounding.

Findings

01

Achieved state-of-the-art results on multiple benchmarks.

02

Supported input resolution up to 1344x1344 pixels.

03

Integrated multi-frame input with timestamps for video understanding.

Abstract

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \times 1344$ pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Video Analysis and Summarization