CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning
Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin, Xu, Lei Hou, Juanzi Li, Yuxiao Dong, Jie Tang

TL;DR
CogCoM introduces a step-by-step visual reasoning mechanism inspired by human cognition, enabling large models to solve visual problems with interpretability and state-of-the-art accuracy across multiple benchmarks.
Contribution
The paper proposes Chain of Manipulations, a novel reasoning mechanism for vision-language models, along with a flexible design, data pipeline, and training process for versatile visual reasoning.
Findings
Achieves state-of-the-art results on 9 benchmarks
Demonstrates effective multi-turn multi-image reasoning
Provides interpretable reasoning traces
Abstract
Vision-Language Models (VLMs) have demonstrated their broad effectiveness thanks to extensive training in aligning visual instructions to responses. However, such training of conclusive alignment leads models to ignore essential visual reasoning, further resulting in failures in meticulous visual problems and unfaithful responses. Drawing inspiration from human cognition in solving visual problems (e.g., marking, zoom in), this paper introduces Chain of Manipulations, a mechanism that enables VLMs to solve problems step-by-step with evidence. After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) with results (e.g., boxes, image) actively without involving external tools, while also allowing users to trace error causes. We study the roadmap to implement this mechanism, including (1) a flexible design of manipulations upon…
Peer Reviews
Decision·ICLR 2025 Poster
* The paper shows promising results. * The paper is well written and easy to understand. * The paper includes extensive qualitative examples. * The proposed CoM dataset has the potential to be useful to the broader community.
* Novelty: The core idea of the paper is very similar to that of "Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models, CVPR 2024" and "Visual Programming: Compositional Visual Reasoning Without Training, CVPR 2023". The Visual Program Distillation work trains the VLM to perform several grounding tasks for more accurate visual reasoning. Somewhat similarly, "Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models, arX
1. The proposed method, defining operations of image manipulation to solve queries that require a detailed look is intuitive, is intuitive and clear. 2. Intensive experiment results compared with up-to-date models are shown on various tasks, leading to SoTA performance. 3. writing is clear and easy to follow
1. One thing not very clear - in inference time, once the model is trained, is step-by-step CoM still needed? In other words, is CoM a method for collecting training data, or is it for VLM inference as well? Fig-6 right breaks down the questions into eight groups based on the time overhead - is this time the inference time using one single VLM call or multiple calls using CoM is needed? 2. In principle, how does this work compare to visual programming [1,2]? Is the defined manipulations like Zoo
1 - The paper's contributions are very sound. The idea of collecting data automatically, but having a way to pseudo-verify its correctness (at least end-to-end correctness) is reasonable. 2 - The paper shows good results across a large variety of benchmarks. 3 - The paper additionally contributes extra human-annotated data for mathematical visual reasoning, which can be useful for the community.
1 - From a scientific point of view, it is hard to tell how significant the contributions are. There is one ablation (which is informative) where they remove the multi-turn training data from their training mix, and it seems like it helps overall (but very little for two out of the three tasks shown). However, it is unclear how much of the contribution comes from the single-turn vs. multi-turn setting, or the data, or the combination of both, or just any other modeling decision when training the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
