CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning

Ji Qi; Ming Ding; Weihan Wang; Yushi Bai; Qingsong Lv; Wenyi Hong; Bin; Xu; Lei Hou; Juanzi Li; Yuxiao Dong; Jie Tang

arXiv:2402.04236·cs.CV·March 4, 2025·2 cites

CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning

Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin, Xu, Lei Hou, Juanzi Li, Yuxiao Dong, Jie Tang

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

CogCoM introduces a step-by-step visual reasoning mechanism inspired by human cognition, enabling large models to solve visual problems with interpretability and state-of-the-art accuracy across multiple benchmarks.

Contribution

The paper proposes Chain of Manipulations, a novel reasoning mechanism for vision-language models, along with a flexible design, data pipeline, and training process for versatile visual reasoning.

Findings

01

Achieves state-of-the-art results on 9 benchmarks

02

Demonstrates effective multi-turn multi-image reasoning

03

Provides interpretable reasoning traces

Abstract

Vision-Language Models (VLMs) have demonstrated their broad effectiveness thanks to extensive training in aligning visual instructions to responses. However, such training of conclusive alignment leads models to ignore essential visual reasoning, further resulting in failures in meticulous visual problems and unfaithful responses. Drawing inspiration from human cognition in solving visual problems (e.g., marking, zoom in), this paper introduces Chain of Manipulations, a mechanism that enables VLMs to solve problems step-by-step with evidence. After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) with results (e.g., boxes, image) actively without involving external tools, while also allowing users to trace error causes. We study the roadmap to implement this mechanism, including (1) a flexible design of manipulations upon…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

* The paper shows promising results. * The paper is well written and easy to understand. * The paper includes extensive qualitative examples. * The proposed CoM dataset has the potential to be useful to the broader community.

Weaknesses

* Novelty: The core idea of the paper is very similar to that of "Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models, CVPR 2024" and "Visual Programming: Compositional Visual Reasoning Without Training, CVPR 2023". The Visual Program Distillation work trains the VLM to perform several grounding tasks for more accurate visual reasoning. Somewhat similarly, "Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models, arX

Reviewer 02Rating 6Confidence 4

Strengths

1. The proposed method, defining operations of image manipulation to solve queries that require a detailed look is intuitive, is intuitive and clear. 2. Intensive experiment results compared with up-to-date models are shown on various tasks, leading to SoTA performance. 3. writing is clear and easy to follow

Weaknesses

1. One thing not very clear - in inference time, once the model is trained, is step-by-step CoM still needed? In other words, is CoM a method for collecting training data, or is it for VLM inference as well? Fig-6 right breaks down the questions into eight groups based on the time overhead - is this time the inference time using one single VLM call or multiple calls using CoM is needed? 2. In principle, how does this work compare to visual programming [1,2]? Is the defined manipulations like Zoo

Reviewer 03Rating 6Confidence 4

Strengths

1 - The paper's contributions are very sound. The idea of collecting data automatically, but having a way to pseudo-verify its correctness (at least end-to-end correctness) is reasonable. 2 - The paper shows good results across a large variety of benchmarks. 3 - The paper additionally contributes extra human-annotated data for mathematical visual reasoning, which can be useful for the community.

Weaknesses

1 - From a scientific point of view, it is hard to tell how significant the contributions are. There is one ablation (which is informative) where they remove the multi-turn training data from their training mix, and it seems like it helps overall (but very little for two out of the three tasks shown). However, it is unclear how much of the contribution comes from the single-turn vs. multi-turn setting, or the data, or the combination of both, or just any other modeling decision when training the

Code & Models

Repositories

thudm/cogcom
pytorchOfficial

Datasets

qijimrc/CoMDataset
dataset· 204 dl
204 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications