GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

GLM-V Team: Wenyi Hong; Xiaotao Gu; Ziyang Pan; Zhen Yang; Yuting Wang; Yue Wang; Yuanchang Yue; Yu Wang; Yanling Wang; Yan Wang; Xijun Liu; Wenmeng Yu; Weihan Wang; Wei Li; Shuaiqi Duan; Sheng Yang; Ruiliang Lv; Mingdao Liu; Lihang Pan; Ke Ning; Junhui Ji; Jinjiang Wang; Jing Chen; Jiazheng Xu; Jiale Zhu; Jiale Cheng; Ji Qi; Guobing Gan; Guo Wang; Cong Yao; Zijun Dou; Zihao Zhou; Zihan Wang; Zhiqi Ge; Zhijie Li; Zhenyu Hou; Zhao Xue; Zehui Wang; Zehan Qi; Zehai He; Yutao Zhang; Yusen Liu; Yukuo Cen; Yuchen Li; Yuan Wang; Yu Yang; Yongbin Liu; Yijian Lu; Yifan Xu; Yanzi Wang; Yanxiao Zhao; Yanfeng Wang; Yadong Xue; Yabo Xu; Xinyu Zhang; Xinyu Liu; Xiao Liu; Wenyi Zhao; Wenkai Li; Tianyu Tong; Tianshu Zhang; Shudan Zhang; Shengdong Yan; Qinkai Zheng; Mingde Xu; Licheng Bao; lat Long long; Jiaxing Xu; Jiaxin Fan; Jiawen Qian; Jiali Chen; Jiahui Lin; Jiadai Sun; Haozhi Zheng; Haoran Wang; Haochen Li; Hanyu Lai; Han Xu; Fan Yang; Dan Zhang; Da Yin; Chuangxin Zhao; Chengcheng Wu; Boyan Shi; Bowen Lv; Bowei Jia; Bo Li; Bin Chen; Baoxu Wang; Peng Zhang; Debing Liu; Bin Xu; Juanzi Li; Minlie Huang; Yuxiao Dong; Jie Tang

arXiv:2604.26752·cs.CV·May 13, 2026

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

GLM-V Team: Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, Xijun Liu, Wenmeng Yu, Weihan Wang, Wei Li, Shuaiqi Duan, Sheng Yang, Ruiliang Lv, Mingdao Liu, Lihang Pan, Ke Ning, Junhui Ji, Jinjiang Wang

PDF

1 Repo

TL;DR

GLM-5V-Turbo is a multimodal foundation model designed for agents that integrates perception with reasoning, planning, and tool use, enabling effective handling of heterogeneous contexts like images and videos.

Contribution

It introduces a model that embeds multimodal perception into core reasoning and agentic tasks, advancing the development of native multimodal foundation models for agents.

Findings

01

Strong performance in multimodal coding and visual tool use.

02

Effective integration of perception with reasoning and planning.

03

Maintains competitive text-only coding capabilities.

Abstract

We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zai-org/GLM-V
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.